[00:00:44] or i can remove myself but not save that because "You no longer have permission to access this document, so your changes can't be saved. " hahah [00:20:08] (03CR) 10Dzahn: [C: 032] Add djvu tools for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/165329 (owner: 10Cscott) [00:25:37] (03CR) 10Dzahn: "@ocg1001:~# /usr/bin/ddjvu --help" [puppet] - 10https://gerrit.wikimedia.org/r/165329 (owner: 10Cscott) [00:44:54] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [01:05:59] (03PS1) 10Dzahn: elasticsearch - delete pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165672 [01:07:06] (03PS1) 10Dzahn: facilities - remove Tampa power strip monitors [puppet] - 10https://gerrit.wikimedia.org/r/165673 [01:07:47] (03PS2) 10Dzahn: elasticsearch - delete pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165672 [01:10:16] (03PS1) 10Dzahn: rancid - remove pmtpa devices from router.db [puppet] - 10https://gerrit.wikimedia.org/r/165674 [01:16:56] (03PS1) 10Dzahn: remove pdf servers and role::pdf [puppet] - 10https://gerrit.wikimedia.org/r/165676 [01:18:32] (03PS2) 10Dzahn: remove pdf servers,role::pdf and misc pdf class [puppet] - 10https://gerrit.wikimedia.org/r/165676 [01:20:56] (03PS1) 10Dzahn: rolematcher - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165677 [01:26:44] (03PS2) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [01:26:46] (03PS5) 10Chad: First of (hopefully many) es-tool commands [puppet] - 10https://gerrit.wikimedia.org/r/163945 [01:26:48] (03PS3) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [01:26:50] (03PS4) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [01:26:52] (03PS1) 10Dzahn: redis - remove pmtpa monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/165678 [01:34:20] PROBLEM - Disk space on analytics1035 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 149638 MB (3% inode=99%): [02:17:53] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-09 02:17:53+00:00 [02:18:04] Logged the message, Master [02:30:03] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-09 02:30:03+00:00 [02:30:11] Logged the message, Master [03:26:59] (03PS3) 10KartikMistry: WIP: apertium service for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [03:33:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 9 03:33:47 UTC 2014 (duration 33m 46s) [03:33:53] Logged the message, Master [03:55:02] (03PS3) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [04:43:30] (03CR) 10Krinkle: "Per https://bugzilla.wikimedia.org/show_bug.cgi?id=71761, this doesn't seem to stop it from existing instances. Looks like it might need a" [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [04:44:57] (03CR) 10Krinkle: "I don't mind, but I don't see why we'd do it different here. Seems simple enough to just ensure absent and it'll just be removed automatic" [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [04:58:20] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [04:59:52] <^d> ^ known, I'm doing stuff. [05:00:16] :) [05:00:39] <^d> Should recover in a minute or two. icinga happened to hit 1008 *right* as the service was bouncing :) [05:09:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [05:19:07] (03PS2) 10KartikMistry: WIP: Added initial Debian package [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/165528 [06:07:59] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:16:50] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [06:28:40] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:29:02] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] (03PS1) 10Giuseppe Lavagetto: mediawiki: consolidate hosts in site.pp, convert mw1053 [puppet] - 10https://gerrit.wikimedia.org/r/165696 [06:36:29] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:41:40] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [06:46:40] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:53:51] <_joe_> ori: "pcc [06:53:59] <_joe_> damn keyboard [06:54:08] <_joe_> "pcc" is really _awesome_ [06:57:57] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: consolidate hosts in site.pp, convert mw1053 [puppet] - 10https://gerrit.wikimedia.org/r/165696 (owner: 10Giuseppe Lavagetto) [07:00:47] _joe_: :) [07:01:29] PROBLEM - puppet last run on mw1053 is CRITICAL: Timeout while attempting connection [07:02:19] PROBLEM - Host mw1053 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:02:35] <_joe_> !log reinstalling mw1053 [07:02:42] Logged the message, Master [07:03:21] <_joe_> the problem with writing procedures is you cant miss even one point [07:05:34] <_joe_> (like "schedule downtime in icinga") [07:07:29] RECOVERY - Host mw1053 is UP: PING OK - Packet loss = 0%, RTA = 5.30 ms [07:08:39] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54251 bytes in 7.403 second response time [07:11:40] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:12:35] <_joe_> hiera doesn't work as expected on the puppet compiler... damn, I really got to work on it. [07:14:41] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert three more appservers to HAT. [puppet] - 10https://gerrit.wikimedia.org/r/165697 [07:15:09] <_joe_> !log reimaging mw102[3-5] to hhvm [07:15:16] Logged the message, Master [07:24:39] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: convert three more appservers to HAT. [puppet] - 10https://gerrit.wikimedia.org/r/165697 (owner: 10Giuseppe Lavagetto) [07:31:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [07:43:19] PROBLEM - DPKG on mw1053 is CRITICAL: Connection refused by host [07:43:24] PROBLEM - nutcracker process on mw1053 is CRITICAL: Connection refused by host [07:43:38] PROBLEM - puppet last run on mw1053 is CRITICAL: Connection refused by host [07:43:38] PROBLEM - Disk space on mw1053 is CRITICAL: Connection refused by host [07:44:18] PROBLEM - RAID on mw1053 is CRITICAL: Connection refused by host [07:44:39] PROBLEM - check configured eth on mw1053 is CRITICAL: Connection refused by host [07:45:01] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: Connection refused by host [07:45:02] PROBLEM - check if salt-minion is running on mw1053 is CRITICAL: Connection refused by host [07:45:18] <_joe_> didn't I schedule downtime there? [07:45:24] <_joe_> ok, whatever. [07:45:38] PROBLEM - nutcracker port on mw1053 is CRITICAL: Connection refused by host [07:48:09] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54230 bytes in 0.370 second response time [07:56:09] PROBLEM - NTP on mw1053 is CRITICAL: NTP CRITICAL: Offset unknown [07:59:28] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [07:59:29] RECOVERY - DPKG on mw1053 is OK: All packages OK [07:59:29] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:59:38] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [07:59:38] RECOVERY - Disk space on mw1053 is OK: DISK OK [07:59:49] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [08:00:10] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [08:00:19] RECOVERY - check if salt-minion is running on mw1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:01:37] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [08:02:46] PROBLEM - puppet last run on mw1024 is CRITICAL: Connection refused by host [08:02:47] PROBLEM - Disk space on mw1024 is CRITICAL: Connection refused by host [08:03:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [08:03:01] PROBLEM - RAID on mw1023 is CRITICAL: Connection refused by host [08:03:06] PROBLEM - nutcracker port on mw1025 is CRITICAL: Connection refused by host [08:03:16] PROBLEM - DPKG on mw1025 is CRITICAL: Connection refused by host [08:03:16] PROBLEM - nutcracker process on mw1025 is CRITICAL: Connection refused by host [08:03:27] PROBLEM - Disk space on mw1025 is CRITICAL: Connection refused by host [08:03:27] PROBLEM - check configured eth on mw1023 is CRITICAL: Connection refused by host [08:03:27] PROBLEM - puppet last run on mw1025 is CRITICAL: Connection refused by host [08:03:39] PROBLEM - RAID on mw1024 is CRITICAL: Connection refused by host [08:03:39] PROBLEM - check if dhclient is running on mw1023 is CRITICAL: Connection refused by host [08:03:58] PROBLEM - check if salt-minion is running on mw1023 is CRITICAL: Connection refused by host [08:04:08] <_joe_> oh I hate you icinga [08:04:09] PROBLEM - check configured eth on mw1024 is CRITICAL: Connection refused by host [08:04:19] PROBLEM - check if dhclient is running on mw1024 is CRITICAL: Connection refused by host [08:04:19] PROBLEM - RAID on mw1025 is CRITICAL: Connection refused by host [08:04:41] PROBLEM - check if salt-minion is running on mw1024 is CRITICAL: Connection refused by host [08:04:42] RECOVERY - check configured eth on mw1023 is OK: NRPE: Unable to read output [08:04:48] PROBLEM - check configured eth on mw1025 is CRITICAL: Connection refused by host [08:04:48] RECOVERY - check if dhclient is running on mw1023 is OK: PROCS OK: 0 processes with command name dhclient [08:04:49] RECOVERY - RAID on mw1024 is OK: OK: no RAID installed [08:04:58] PROBLEM - check if dhclient is running on mw1025 is CRITICAL: Connection refused by host [08:04:58] RECOVERY - check if salt-minion is running on mw1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:04:59] RECOVERY - Disk space on mw1024 is OK: DISK OK [08:05:09] PROBLEM - check if salt-minion is running on mw1025 is CRITICAL: Connection refused by host [08:05:09] RECOVERY - check configured eth on mw1024 is OK: NRPE: Unable to read output [08:05:19] RECOVERY - RAID on mw1023 is OK: OK: no RAID installed [08:05:29] RECOVERY - nutcracker port on mw1025 is OK: TCP OK - 0.000 second response time on port 11212 [08:05:37] RECOVERY - check if dhclient is running on mw1024 is OK: PROCS OK: 0 processes with command name dhclient [08:05:37] RECOVERY - RAID on mw1025 is OK: OK: no RAID installed [08:05:39] RECOVERY - DPKG on mw1025 is OK: All packages OK [08:05:39] RECOVERY - nutcracker process on mw1025 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:05:50] RECOVERY - check if salt-minion is running on mw1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:05:50] RECOVERY - Disk space on mw1025 is OK: DISK OK [08:05:59] RECOVERY - check configured eth on mw1025 is OK: NRPE: Unable to read output [08:05:59] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [08:06:17] RECOVERY - check if dhclient is running on mw1025 is OK: PROCS OK: 0 processes with command name dhclient [08:06:18] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: Puppet has 1 failures [08:06:27] RECOVERY - check if salt-minion is running on mw1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:06:56] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [08:08:00] <_joe_> for the record, this happens because the host gets removed from icinga when we clean up puppet facts, thus the downtime gets yanked [08:10:07] RECOVERY - NTP on mw1053 is OK: NTP OK: Offset -0.001644015312 secs [08:14:16] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:14:26] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:14:27] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:15:07] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:15:16] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:26] PROBLEM - NTP on mw1023 is CRITICAL: NTP CRITICAL: Offset unknown [08:15:37] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:06] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 66889 bytes in 0.291 second response time [08:16:07] PROBLEM - NTP on mw1024 is CRITICAL: NTP CRITICAL: Offset unknown [08:16:37] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66889 bytes in 0.301 second response time [08:16:43] PROBLEM - NTP on mw1025 is CRITICAL: NTP CRITICAL: Offset unknown [08:17:22] <_joe_> !log repooling mw102[3-5],mw1053 in the hhvm pool [08:17:27] RECOVERY - NTP on mw1023 is OK: NTP OK: Offset -0.01114153862 secs [08:17:27] Logged the message, Master [08:18:08] RECOVERY - NTP on mw1024 is OK: NTP OK: Offset -0.02184319496 secs [08:18:36] RECOVERY - NTP on mw1025 is OK: NTP OK: Offset -0.0204795599 secs [08:21:05] (03CR) 10Nemo bis: "At this stage in the setup, is it possible to check if the LanguageConverter class is loaded for that wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) (owner: 10Reedy) [08:45:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small correction, LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [08:46:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [08:59:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:14:35] (03PS1) 10Glaisher: Add 'abusefilter-modify-restricted' to sysops at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165704 (https://bugzilla.wikimedia.org/71854) [09:15:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [09:19:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [09:23:57] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [09:38:19] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:43:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [09:45:05] (03CR) 10Filippo Giunchedi: First of (hopefully many) es-tool commands (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [09:47:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [09:48:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [09:51:49] that ws planned maintenance [09:51:58] (03CR) 10Filippo Giunchedi: [C: 031] "first time ever I hear about rolematcher, wikitech shows no hits too" [puppet] - 10https://gerrit.wikimedia.org/r/165677 (owner: 10Dzahn) [09:59:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [10:04:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [10:07:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [10:11:41] ...and that's outside the maintenance window [10:20:43] (03PS7) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [10:22:43] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [10:35:03] ah godog [10:35:14] <_joe_> !log disabling puppet on most mw* hosts while testing apache changes [10:35:20] Logged the message, Master [10:35:25] apergos: hey! [10:35:50] so whil folk are touching apache (well afterwards), there's "rotate all the logs and only keep two weeks' worth" [10:36:05] which woudl mean a daily apache graceful on all the apaches as the current patchset has it [10:36:36] https://gerrit.wikimedia.org/r/#/c/130296/ when you're done with the current stuff, mind taking a look? [10:36:58] <_joe_> apergos: yeah I am aware of that patch [10:37:02] I'm just unsure about all of them gracefulling at the same time [10:37:09] ah _joe_, good... [10:37:12] <_joe_> (godog == Filippo) [10:37:21] bah [10:37:23] <_joe_> tu quoque, apergos :P [10:37:34] I just remember that he has a nick that goes to some unrelated handle :-D [10:37:37] <_joe_> Giuseppe == joseph => joe [10:37:46] yeah yours makes sense [10:38:24] haha indeed mine doesn't [10:39:09] * apergos shuts up and waits for the config stuff to be tested and happy first [10:39:47] apergos: patch still looks good to me though :) [10:40:34] :-) [10:41:07] what do you think about all the apaches gracefulling at once? I guess a scap does that (or does it any more?) [10:43:14] <_joe_> it does not [10:43:35] <_joe_> apache gracefulling is nice anyway, but why do we need that? [10:44:34] well either you copytruncate or you do that, otherwise the clients will write to the wrong logs [10:44:41] after rotation [10:45:12] unles there's some other thing you had in mind [10:45:46] <_joe_> I don't remember if there is an internal way to rotate logs in apache [10:48:22] it looks like not so much [10:48:36] there's the rotatelogs thing for use with piped logs, it's an external program [10:49:48] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [10:51:07] (03PS5) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 [10:51:48] (03CR) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync (033 comments) [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [10:58:08] <_joe_> damn apache configs; they're full of subtelties [11:00:46] * YuviPanda switches everything to nginx+fastcgi rather than apache+fastcgi [11:04:27] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:08:34] (03PS1) 10Giuseppe Lavagetto: mediawiki: restore redirect to https for donatewiki robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/165706 [11:08:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: restore redirect to https for donatewiki robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/165706 (owner: 10Giuseppe Lavagetto) [11:11:39] !log disabled puppet in ms-fe/ms-be in eqiad/codfw to merge container-sync changes [11:11:46] Logged the message, Master [11:11:56] <_joe_> !log reenabled puppet on mw* [11:12:03] Logged the message, Master [11:12:17] (03PS3) 10Filippo Giunchedi: swift: add container sync [puppet] - 10https://gerrit.wikimedia.org/r/160430 [11:13:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add container sync [puppet] - 10https://gerrit.wikimedia.org/r/160430 (owner: 10Filippo Giunchedi) [11:13:48] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet last ran 88061 seconds ago, expected 14400 [11:14:57] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:18:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [11:20:42] :P [11:22:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [11:27:28] srsly? [11:39:02] !log starting upgrade of elastic1009 [11:39:12] Logged the message, Master [11:42:34] (03PS1) 10Filippo Giunchedi: swift: move hiera params to the right place [puppet] - 10https://gerrit.wikimedia.org/r/165708 [11:43:23] _joe_: ^ [11:43:59] (03CR) 10Giuseppe Lavagetto: [C: 031] swift: move hiera params to the right place [puppet] - 10https://gerrit.wikimedia.org/r/165708 (owner: 10Filippo Giunchedi) [11:52:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: move hiera params to the right place [puppet] - 10https://gerrit.wikimedia.org/r/165708 (owner: 10Filippo Giunchedi) [11:52:27] thanks! [11:57:19] !log xtrabackup db1016 to db2010 [11:57:26] Logged the message, Master [11:59:46] !log converted some librenms tables to innodb on db1001 m1-master. should be a no-op [11:59:51] Logged the message, Master [12:01:55] !log reedy Purged l10n cache for 1.24wmf22 [12:02:01] Logged the message, Master [12:02:53] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 338 seconds [12:13:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [12:16:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:52:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [12:57:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:57:40] what the hell [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141009T1300). [13:03:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [13:07:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [13:17:54] (03CR) 10Springle: [C: 031] "DB bits are ready. See RT." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [13:19:19] akosiaris: Argh crap I'm an idiot [13:19:32] akosiaris: My site.pp patch installs role::citoid , but it needs to be role::citoid::production :S [13:19:37] * RoanKattouw writes a patch [13:20:03] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 337 seconds [13:20:33] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 344 seconds [13:20:35] But... it is? Huh? [13:21:09] OK salt documentation time then [13:21:15] salt ? [13:21:22] trebuchet you mean ? [13:21:25] For git-deploy [13:21:27] Yeah trebuchet [13:21:45] it should be enough to do a git-deploy start ; git-deploy sync [13:21:53] on /srv/deployment/citoid/deploy on tin [13:21:53] I'll try that now [13:22:03] I just checked and /srv/deployment/citoid doesn't even exist on sca1001 yet [13:22:07] I had expected puppet to create that already [13:22:09] But I'll try [13:22:36] I kind of did too. Not absolutely sure yet about how ori trebuchet package provider works [13:23:03] and since this is a first in production, we are kind of guinea pigs :-) [13:23:20] Well that worked beautifully [13:23:24] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [13:23:34] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay -0 seconds [13:23:36] (03CR) 10Chad: First of (hopefully many) es-tool commands (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [13:24:02] ori: Thank you ! [13:24:11] RoanKattouw: thanks as well :-) [13:24:23] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:24:31] OK citoid is running on both sca1001 and sca1002 now [13:24:33] RECOVERY - citoid on sca1001 is OK: HTTP OK: HTTP/1.1 200 OK - 745 bytes in 0.021 second response time [13:24:43] I had to start it on 1001 but it had already started on 1002, maybe because puppet was just running there [13:24:47] Sweet! [13:24:56] Cool! [13:25:00] so ... LVS time then [13:25:05] RECOVERY - citoid on sca1002 is OK: HTTP OK: HTTP/1.1 200 OK - 745 bytes in 0.033 second response time [13:25:06] Yes exactly [13:25:11] That's not working yet [13:25:13] * RoanKattouw looks at LVS logs [13:25:15] ok, reviewing one last time and merging [13:25:40] 10.2.2.19 is Destination Unreachable [13:26:34] Oh right that change isn't merged yet [13:26:38] * RoanKattouw glares at Gerrit search [13:26:49] (03PS2) 10Alexandros Kosiaris: Add LVS for citoid [puppet] - 10https://gerrit.wikimedia.org/r/164759 (owner: 10Catrope) [13:28:41] (03CR) 10Alexandros Kosiaris: [C: 032] Add LVS for citoid [puppet] - 10https://gerrit.wikimedia.org/r/164759 (owner: 10Catrope) [13:36:33] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [13:37:58] RoanKattouw: I dare say that LVS is done and works fine as well [13:37:58] (03CR) 10Cscott: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/165329 (owner: 10Cscott) [13:38:26] Yup, works for me [13:38:33] Awesome! Thank you so much! [13:39:36] (03PS4) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [13:39:37] RoanKattouw: thanks as well! Seems like we got a new service [13:39:38] (03PS6) 10Chad: First of (hopefully many) es-tool commands [puppet] - 10https://gerrit.wikimedia.org/r/163945 [13:39:40] (03PS4) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [13:39:42] (03PS5) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [13:40:13] Indeed we did [13:40:32] Now I just need to refactor the code that uses it and implement the new UI for it and and and :D [13:40:46] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:41:13] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [13:41:15] ETOOMANYANDS [13:41:34] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [13:45:19] (03CR) 10Chad: Adding tools for banning/unbanning an ES node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164617 (owner: 10Chad) [13:50:23] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [13:50:23] (03CR) 10Filippo Giunchedi: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [13:54:08] (03CR) 10Filippo Giunchedi: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [13:54:15] (03CR) 10Filippo Giunchedi: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [13:54:36] !log begin reimaging of mw1029 [13:54:41] Logged the message, Master [13:55:23] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [13:56:34] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:02:02] !log updated pybal on palladium for citoid [14:02:07] Logged the message, Master [14:03:16] (03PS1) 10coren: Reimaging mw1029 as appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/165719 [14:03:18] (03PS5) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [14:03:20] (03PS7) 10Chad: First of (hopefully many) es-tool commands [puppet] - 10https://gerrit.wikimedia.org/r/163945 [14:03:22] (03PS5) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [14:03:24] (03PS6) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [14:03:28] (03CR) 10Chad: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:03:41] (03CR) 10Giuseppe Lavagetto: [C: 031] Reimaging mw1029 as appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/165719 (owner: 10coren) [14:04:01] (03CR) 10Chad: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:04:17] PROBLEM - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [14:04:41] (03CR) 10coren: [C: 032] Reimaging mw1029 as appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/165719 (owner: 10coren) [14:05:19] RoanKattouw: that you ? ^ [14:07:49] akosiaris: I haven't touched anything [14:07:55] Let me look at the sca boxes [14:08:36] Hmm I found some nice error logs here [14:09:18] (03PS1) 10Filippo Giunchedi: swift: fix hiera variables naming [puppet] - 10https://gerrit.wikimedia.org/r/165722 [14:09:27] HTTP to localhost is working fine on both sca1001 and 1002 though [14:09:41] Because Zotero is what's failing and the welcome page (which is what the LVS health checks use) doesn't use Zotero [14:09:48] Resolving citoid.svc.eqiad.wmflabs (citoid.svc.eqiad.wmflabs)... failed: Temporary failure in name resolution. [14:09:51] Ahm... [14:10:09] akosiaris: Did the Citoid DNS change get deployed during the codfw network cut maybe? [14:10:50] Oh nm I'm an idiot, I need to learn how to type [14:11:03] wmnet not wmflabs [14:11:22] akosiaris: I don't know man, icinga-wm is on crack. I can hit citoid just fine [14:11:39] $ wget http://citoid.svc.eqiad.wmnet:1970/ -O- is super fast [14:12:42] RoanKattouw: yeah, I am noticing the same .... [14:13:08] maybe a check issue... [14:15:04] (03PS2) 10Manybubbles: Elasticsearch Drop number of concurrent merges [puppet] - 10https://gerrit.wikimedia.org/r/163188 [14:15:51] I checked the pybal log and those checks seem to be happy [14:16:30] <_joe_> akosiaris: IP issue? [14:16:37] <_joe_> no if the checks work [14:17:30] found it [14:18:05] /usr/lib/nagios/plugins/check_http -H citoid.svc.eqiad.wmnet -p 1970 -I 10.2.2.19 -u "" [14:18:06] HTTP CRITICAL - No data received from host [14:18:06] /usr/lib/nagios/plugins/check_http -H citoid.svc.eqiad.wmnet -p 1970 -I 10.2.2.19 -u "/" [14:18:06] HTTP OK: HTTP/1.1 200 OK - 745 bytes in 0.002 second response time |time=0.001569s;;;0.000000 size=745B;;;0 [14:18:29] so the check needs another !/ at the end [14:18:38] (03CR) 10Faidon Liambotis: [C: 031] Remove all IMAP configuration and Puppet manifests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164940 (owner: 10Mark Bergsma) [14:18:42] (03PS3) 10Faidon Liambotis: Remove all IMAP configuration and Puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/164940 (owner: 10Mark Bergsma) [14:18:57] (03CR) 10Faidon Liambotis: [C: 032] Remove all IMAP configuration and Puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/164940 (owner: 10Mark Bergsma) [14:19:03] mathoid's url displatching is probably different which is why it is working [14:19:11] RoanKattouw: _joe_ ^ [14:19:14] displatching? [14:19:21] dispatching [14:20:14] could be displatching as well [14:20:18] http://www.urbandictionary.com/define.php?term=displatch [14:20:36] it is node.js after all [14:20:40] <_joe_> translated: "we need a rewrite rule" [14:20:54] WTF [14:21:04] citoid breaks on "" but works for "/" , that's odd [14:21:27] The web server part of citoid is <50 lines so let me see if I can quickly fix that [14:21:34] -u, --url=PATH [14:21:35] URL to GET or POST (default: /) [14:21:51] (03PS2) 10Filippo Giunchedi: swift: fix hiera variables naming [puppet] - 10https://gerrit.wikimedia.org/r/165722 [14:21:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix hiera variables naming [puppet] - 10https://gerrit.wikimedia.org/r/165722 (owner: 10Filippo Giunchedi) [14:31:10] ACKNOWLEDGEMENT - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host alexandros kosiaris citoid seems to not like being queried without a URL at the request. Investigated at citoid level by Roan, fallback plan is to adjust the check [14:31:21] (03CR) 10Filippo Giunchedi: First of (hopefully many) es-tool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:31:39] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:32:02] hey an ack page, neat [14:34:13] ahahaha [14:34:22] that's a first [14:34:27] useful though [14:34:59] btw... 2 minutes delay between you, bblack getting the page and me getting the page [14:38:02] it says 14:31 but it was a lie, it came later [14:38:25] prolly about a minute after you a kosiaris [14:40:33] akosiaris: Hmm I'm out of my league here trying to figure out this problem, so I'd rather bounce it to Gabriel. I'll see if I can change the check in the meantime [14:40:49] <_joe_> I didn't get the page... [14:40:56] _joe_: great! nothing to worry about [14:40:57] seriously now, this is something to investigate [14:43:16] (03PS1) 10Catrope: Work around Citoid bug in health check [puppet] - 10https://gerrit.wikimedia.org/r/165731 [14:43:16] akosiaris: ---^^ [14:43:47] PROBLEM - RAID on mw1029 is CRITICAL: Connection refused by host [14:44:07] PROBLEM - check configured eth on mw1029 is CRITICAL: Connection refused by host [14:44:18] PROBLEM - check if dhclient is running on mw1029 is CRITICAL: Connection refused by host [14:44:28] PROBLEM - check if salt-minion is running on mw1029 is CRITICAL: Connection refused by host [14:45:02] PROBLEM - nutcracker port on mw1029 is CRITICAL: Connection refused by host [14:45:08] PROBLEM - nutcracker process on mw1029 is CRITICAL: Connection refused by host [14:45:09] PROBLEM - DPKG on mw1029 is CRITICAL: Connection refused by host [14:45:18] PROBLEM - Disk space on mw1029 is CRITICAL: Connection refused by host [14:45:18] PROBLEM - puppet last run on mw1029 is CRITICAL: Connection refused by host [14:45:54] I am wondering.... [14:45:57] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 6706 seconds [14:46:18] is that the server being re-imaged? [14:47:28] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [14:48:06] RECOVERY - check if salt-minion is running on mw1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:48:27] RECOVERY - RAID on mw1029 is OK: OK: no RAID installed [14:48:28] RECOVERY - nutcracker port on mw1029 is OK: TCP OK - 0.000 second response time on port 11212 [14:48:46] RECOVERY - DPKG on mw1029 is OK: All packages OK [14:48:46] RECOVERY - nutcracker process on mw1029 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:48:47] RECOVERY - check configured eth on mw1029 is OK: NRPE: Unable to read output [14:48:56] RECOVERY - Disk space on mw1029 is OK: DISK OK [14:48:56] RECOVERY - check if dhclient is running on mw1029 is OK: PROCS OK: 0 processes with command name dhclient [14:49:18] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [14:49:50] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [14:51:19] manybubbles, marktraceur, ^d: So who wants to SWAT this morning? [14:51:32] prtksxna: Ping for SWAT in about 8.5 minutes [14:51:34] either is fine - I can do it if no noe wants it [14:51:47] manybubbles: I don't want it [14:51:55] I'll do it [14:52:13] Oh crap, SWAT time [14:52:14] anomie: Can I SWAT a VE patch? [14:52:39] RoanKattouw: manybubbles is going to do the SWAT today. There should be time to add it to the list. [14:52:39] RoanKattouw: sure [14:52:48] OK [14:52:52] * RoanKattouw starts cherry-picking [14:54:18] (03PS1) 10Alexandros Kosiaris: Default to UNKNOWN when NRPE checks timeout [puppet] - 10https://gerrit.wikimedia.org/r/165732 [14:54:37] (03PS1) 10Filippo Giunchedi: swift: fix container-sync template [puppet] - 10https://gerrit.wikimedia.org/r/165733 [14:55:10] (03PS2) 10Filippo Giunchedi: swift: fix container-sync template [puppet] - 10https://gerrit.wikimedia.org/r/165733 [14:55:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix container-sync template [puppet] - 10https://gerrit.wikimedia.org/r/165733 (owner: 10Filippo Giunchedi) [14:55:55] I could have, but take it away manybubbles [15:00:05] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141009T1500). [15:00:21] prtksxna: its time for to deploy your SWAT [15:00:27] ready to verify that it worked? [15:00:45] marktraceur and anomie: those less files are just recompiled on change, right? [15:01:14] manybubbles: I have no idea how that works. [15:01:26] manybubbles: they should be, yeah. [15:01:29] manybubbles: I added one just in the nick of time there [15:01:39] anomie: thanks. will ready while I wait for prtksxna or RoanKattouw to be ready [15:01:39] ( https://gerrit.wikimedia.org/r/165738 ) [15:01:45] And I declare myself ready :) [15:02:20] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [15:03:08] ^ that's me testing [15:04:16] manybubbles: https://gerrit.wikimedia.org/r/#/c/163188 good to merge I'm assuming? [15:04:29] godog: fine by me! [15:05:12] RoanKattouw: I'll get that merged and deployed then [15:05:34] (03PS3) 10Filippo Giunchedi: Elasticsearch Drop number of concurrent merges [puppet] - 10https://gerrit.wikimedia.org/r/163188 (owner: 10Manybubbles) [15:05:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Elasticsearch Drop number of concurrent merges [puppet] - 10https://gerrit.wikimedia.org/r/163188 (owner: 10Manybubbles) [15:05:57] manybubbles: ack, it's done! [15:06:03] (03CR) 10Alexandros Kosiaris: swift-synctool: enable/disable/show sync (031 comment) [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [15:06:07] thanks! [15:07:01] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: puppet fail [15:07:23] yeah yeah icinga-wm [15:07:50] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [15:09:57] prtksxna: around? I'd like to do your SWAT deploy in a few minutes [15:13:45] oh crap. I rebased against 1.2_4_wmf2 accidentally. now its sad [15:13:46] ^^ [15:13:48] anomie: ^^ [15:14:28] manybubbles: unhappy as in? [15:14:35] git now hates me [15:14:45] fatal: bad config file line 70 in .gitmodules [15:15:32] I wonder if I can rm php-1.25wmf2 and rebuild it. that is what I do on my laptop when I screw make git this mad [15:15:39] but I can't if there are security patches [15:15:45] I suppose [15:15:49] manybubbles: Wait, on the cluster? [15:15:49] manybubbles: hang on a second [15:16:00] OK I'll let anomie deal with this [15:16:03] RoanKattouw: yeah - I was doing the rebase step and mistyped [15:16:05] manybubbles: Is it fixed-ish now? [15:16:18] anomie: looks much better [15:16:20] what did you do? [15:16:24] manybubbles: You don't really need to rebase any more, we made 'git pull' an alias for 'git pull --rebase' [15:16:38] manybubbles: I saw you were on a detached head, so just git checkout -f wmf/1.25wmf2 [15:16:52] anomie: oh, well, thats probably right [15:16:57] Wait [15:16:57] well, lets just move on then [15:16:59] oh now [15:17:00] no [15:17:06] Hold on let me check something [15:17:17] If there were security patches before, they have to be restored [15:17:21] I forget whether we had any [15:17:43] (and once I find out whether we do, I can't tell this channel anyway) [15:18:16] nice [15:18:31] RoanKattouw: Aren't the security patches locally committed to the wmf branch when we have them? [15:18:42] Maybe? [15:18:58] I'm checking reflog just in case [15:19:01] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:21:03] Reminder: if you have interesting or disruptive stuff that will be deployed next week, please add them to https://meta.wikimedia.org/wiki/Tech/News/2014/42 [15:21:12] manybubbles: OK so you're OK to proceed now on the security patch front, but you're still in a rebase conflict state, so I'm gonna clean that up for you [15:21:29] RoanKattouw: oh, I was just doing it I think [15:21:32] but you can take over [15:21:34] Or a merge conflict state or *something*. git status is crazy [15:21:39] yeah [15:21:58] * anomie was observing that too [15:22:28] Ugh, it's because the local wmf/1.25wmf2 branch is hosed [15:22:29] when I do this locally I always reclone but that isn't an option for us [15:22:32] It's pointing to 24 [15:22:45] RoanKattouw: that is my fault - that is how I rebased it [15:22:52] mistype one fucking number [15:23:45] OK here we go [15:23:47] You're all set now [15:23:59] Your next step is to run git submodule update --recursive extensions/VisualEditor [15:24:03] (don't forget --recursive) [15:24:05] RoanKattouw: Still not on a branch? [15:24:06] manybubbles: Ooh, I did that last week, that's fun [15:24:27] anomie: Ugh, fixing [15:24:30] RoanKattouw: k. can you document how you fix that shit? [15:25:22] manybubbles: I don't really recall how I did it and I'd rather not spend hours writing it up [15:25:30] ah [15:25:42] Instead, you can just run 'git pull'. It's aliased to 'git pull --rebase' in that clone [15:25:58] Deep magic is how I fix stuff like that, I never remember exactly how either [15:26:14] RoanKattouw: I thought we always wanted to git log HEAD..origin/XXX to check what we're getting? [15:26:55] manybubbles: You can do that after git fetch and before git pull [15:26:56] !log upgraded wikitech-static to 1.25wmf2 [15:26:56] ok to sync then? [15:26:56] manybubbles: That's about what I do: git fetch && git log HEAD..origin/XXX, then if that looks good git pull [15:26:56] pull will fetch again, but meh [15:26:56] Logged the message, Master [15:28:31] !log manybubbles Synchronized php-1.25wmf2/extensions/VisualEditor/: SWAT deploy VE cherry-pick (duration: 00m 06s) [15:28:37] Logged the message, Master [15:28:44] RoanKattouw: thanks for fixing it. Here is sync^^ [15:28:59] prtksxna: you are next! ready for swat? [15:29:06] (03CR) 10Alexandros Kosiaris: Added initial Debian packaging (032 comments) [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/165475 (owner: 10KartikMistry) [15:29:32] Thanks manybubbles [15:29:57] thank you! [15:30:53] prtksxna: I'm giving you another ten minutes to ping me as ready for SWAT or I'm booting your change. [15:30:56] <^d> I was afk, sorry guys. [15:31:38] !log done reimaging of mw1029. Now hhvm_appserver [15:31:43] Logged the message, Master [15:31:55] !log begin reimaging of mw1028 [15:31:59] Logged the message, Master [15:32:41] PROBLEM - puppet last run on elastic1011 is CRITICAL: CRITICAL: Puppet has 2 failures [15:34:08] !log restarted Zuul [15:34:13] Logged the message, Master [15:34:35] (03CR) 10Alexandros Kosiaris: "Same questions as in https://gerrit.wikimedia.org/r/#/c/165475/" [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/165473 (owner: 10KartikMistry) [15:37:02] (03PS1) 10coren: Switch mw1028 to appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/165746 [15:40:08] (03CR) 10coren: [C: 032] Switch mw1028 to appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/165746 (owner: 10coren) [15:42:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Typo, plus all the questions from https://gerrit.wikimedia.org/r/#/c/165475/" (031 comment) [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/165471 (owner: 10KartikMistry) [15:44:51] (03CR) 10Filippo Giunchedi: [C: 031] Default to UNKNOWN when NRPE checks timeout [puppet] - 10https://gerrit.wikimedia.org/r/165732 (owner: 10Alexandros Kosiaris) [15:46:46] (03CR) 10Alexandros Kosiaris: "Same questions as for https://gerrit.wikimedia.org/r/#/c/165475/" [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/163578 (owner: 10KartikMistry) [15:47:35] (03CR) 10Alexandros Kosiaris: [C: 032] Add .gitreview file [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/164057 (owner: 10KartikMistry) [15:47:42] (03CR) 10Alexandros Kosiaris: [V: 032] Add .gitreview file [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/164057 (owner: 10KartikMistry) [15:50:01] RECOVERY - puppet last run on elastic1011 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:00:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] Added initial Debian packaging (033 comments) [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [16:06:48] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/165471 [16:08:32] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/165475 [16:10:39] (03CR) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync (031 comment) [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [16:18:51] yo akosiaris, yt? [16:22:30] oblivian is doing a graceful restart of all apaches [16:22:40] PROBLEM - Disk space on mw1028 is CRITICAL: Connection refused by host [16:22:40] PROBLEM - puppet last run on mw1028 is CRITICAL: Connection refused by host [16:22:41] (03PS2) 10Filippo Giunchedi: base: add checks for 127.0.1.1 in /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/157795 [16:22:47] (03PS1) 10Jgreen: remove role::mail::sender from role::labs::instance, it's already included via standard [puppet] - 10https://gerrit.wikimedia.org/r/165751 [16:22:51] oblivian is doing a graceful restart of all apaches [16:23:13] !log oblivian gracefulled all apaches [16:23:21] Logged the message, Master [16:23:29] PROBLEM - RAID on mw1028 is CRITICAL: Connection refused by host [16:23:42] (03CR) 10KartikMistry: Added initial Debian packaging (033 comments) [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [16:23:49] PROBLEM - check configured eth on mw1028 is CRITICAL: Connection refused by host [16:24:10] PROBLEM - check if dhclient is running on mw1028 is CRITICAL: Connection refused by host [16:24:30] PROBLEM - check if salt-minion is running on mw1028 is CRITICAL: Connection refused by host [16:24:43] PROBLEM - nutcracker port on mw1028 is CRITICAL: Connection refused by host [16:24:50] PROBLEM - nutcracker process on mw1028 is CRITICAL: Connection refused by host [16:24:50] PROBLEM - DPKG on mw1028 is CRITICAL: Connection refused by host [16:27:17] <^d> YuviPanda: [2014-10-09 16:19:18,635][WARN ][org.elasticsearch.service.graphite.GraphiteReporter] Error writing to Graphite: Connection timed out [16:27:21] <^d> Ok, that's something ^ [16:27:38] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/165473 [16:27:59] ^d: In what I consider facepalm, I think the problem might be that we've statsd sitting before graphite here, and so technically we need an ES statsd plugin... [16:28:12] <^d> ... [16:28:20] I know. I'm an idiot. [16:28:39] <^d> There is one, luckily. [16:28:42] <^d> https://github.com/swoop-inc/elasticsearch-statsd-plugin [16:28:45] oh? [16:29:19] <^d> Basically identical structure. [16:29:31] yeah [16:30:05] ^d: can we try that? [16:30:10] <^d> Ok, I gotta dump out for a few hours. I'll have a look at deploying this one instead. [16:30:12] there's only minor differences between statsd and graphite... [16:30:16] <^d> *dip out [16:30:22] ^d: cool, thanks! [16:30:25] (03PS4) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/163578 [16:30:27] ^d: and sorry about the wildish goose chase. [16:35:03] (03CR) 10coren: [C: 031] "It is good to remove redundancy and removing redundancy is good." [puppet] - 10https://gerrit.wikimedia.org/r/165751 (owner: 10Jgreen) [16:35:49] PROBLEM - NTP on mw1028 is CRITICAL: NTP CRITICAL: Offset unknown [16:35:49] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:36:01] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:36:26] (03CR) 10Jgreen: [C: 032 V: 031] remove role::mail::sender from role::labs::instance, it's already included via standard [puppet] - 10https://gerrit.wikimedia.org/r/165751 (owner: 10Jgreen) [16:36:40] Why is mw1028 whining? It was in maintenance. [16:36:49] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet last ran 19598 seconds ago, expected 14400 [16:36:50] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet last ran 19630 seconds ago, expected 14400 [16:36:50] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet last ran 19622 seconds ago, expected 14400 [16:36:59] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet last ran 20292 seconds ago, expected 14400 [16:37:00] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet last ran 19887 seconds ago, expected 14400 [16:37:00] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet last ran 19543 seconds ago, expected 14400 [16:37:00] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet last ran 19951 seconds ago, expected 14400 [16:37:01] Oooooh. Stupid icinga. Flexible maintenance. [16:37:31] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet last ran 20059 seconds ago, expected 14400 [16:37:32] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet last ran 19657 seconds ago, expected 14400 [16:37:32] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet last ran 20027 seconds ago, expected 14400 [16:37:40] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Puppet last ran 20658 seconds ago, expected 14400 [16:37:40] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: Puppet last ran 20056 seconds ago, expected 14400 [16:37:50] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet last ran 20372 seconds ago, expected 14400 [16:37:50] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet last ran 19815 seconds ago, expected 14400 [16:38:37] mh missed it by ~5000s :( [16:39:01] !log re-enable puppet on ms-fe/ms-be in codfw [16:39:06] Logged the message, Master [16:40:09] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:41:09] RECOVERY - NTP on mw1028 is OK: NTP OK: Offset -0.01753103733 secs [16:41:24] <_joe_> Coren: because wmf-reimage wipes the host from puppet [16:41:40] <_joe_> so when you reinstall it it's a fresh host [16:41:41] wmf-reimage? [16:41:45] <_joe_> without downtime [16:42:00] _joe_: Oh! Duh! It's obvious in retrospect. [16:42:02] <_joe_> paravoid: ah, a small script that does the cleaning on puppet/salt [16:42:20] <_joe_> and then polls them both to ask you to sign the new keys [16:42:21] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:42:39] nice [16:42:46] <_joe_> Coren: I realized that this morning [16:43:14] just add the idrac/ipmitool commands over there as well [16:43:21] and you're pretty much done ;) [16:43:28] <_joe_> paravoid: mmmh almost, yes [16:43:34] plus a loop to run puppet on boot [16:43:46] <_joe_> the procedure is pretty boring right now, I [16:43:52] !log re-enable puppet on ms-fe/ms-be in eqiad [16:43:53] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:43:55] <_joe_> m trying to make it simpler and simpler [16:43:59] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:43:59] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet last ran 20716 seconds ago, expected 14400 [16:43:59] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet last ran 20053 seconds ago, expected 14400 [16:43:59] Logged the message, Master [16:44:05] [sorry alarm storm inbound] [16:44:09] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Puppet last ran 20576 seconds ago, expected 14400 [16:44:10] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet last ran 20376 seconds ago, expected 14400 [16:44:10] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet last ran 20956 seconds ago, expected 14400 [16:44:10] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet last ran 20947 seconds ago, expected 14400 [16:44:10] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet last ran 20586 seconds ago, expected 14400 [16:44:10] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Puppet last ran 20465 seconds ago, expected 14400 [16:44:19] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet last ran 21081 seconds ago, expected 14400 [16:44:29] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Puppet last ran 20827 seconds ago, expected 14400 [16:44:33] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet last ran 20871 seconds ago, expected 14400 [16:44:33] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet last ran 20019 seconds ago, expected 14400 [16:44:39] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet last ran 20985 seconds ago, expected 14400 [16:44:40] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Puppet last ran 20538 seconds ago, expected 14400 [16:44:49] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet last ran 20124 seconds ago, expected 14400 [16:44:49] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet last ran 20083 seconds ago, expected 14400 [16:44:50] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet last ran 20601 seconds ago, expected 14400 [16:44:50] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Puppet last ran 20515 seconds ago, expected 14400 [16:44:59] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:45:31] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:46:10] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:48:11] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:49:09] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:49:49] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:50:00] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:50:09] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:50:19] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:50:19] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:50:40] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:50:50] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:51:22] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:51:39] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:53:19] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:53:30] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:54:49] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:55:29] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:55:44] manybubbles: o/ [16:56:13] manybubbles: Did I mess up time zones? :( [16:56:29] prtksxna: musta been! SWAT was two hours ago [16:56:29] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:56:43] argghh! :( [16:56:44] 8am SF time/11am my time [16:56:46] sorry! [16:56:49] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:56:53] you can reschedule for the next one? [16:57:34] manybubbles: I read it wrong. I wanted to sleep and was up to get this done now :| [16:57:48] manybubbles: Yeah, I'll do that, but it won't go on all the Wikipedias now¬ [16:57:50] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:57:54] RECOVERY - Disk space on mw1028 is OK: DISK OK [16:57:54] RECOVERY - nutcracker process on mw1028 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:57:54] RECOVERY - DPKG on mw1028 is OK: All packages OK [16:57:59] RECOVERY - check configured eth on mw1028 is OK: NRPE: Unable to read output [16:58:12] RECOVERY - check if dhclient is running on mw1028 is OK: PROCS OK: 0 processes with command name dhclient [16:58:19] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:58:32] RECOVERY - RAID on mw1028 is OK: OK: no RAID installed [16:58:50] <_joe_> !log gracefully restarted again api apaches to recover 500s [16:58:50] RECOVERY - nutcracker port on mw1028 is OK: TCP OK - 0.000 second response time on port 11212 [16:58:51] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [16:58:55] Logged the message, Master [16:59:32] manybubbles: If I get it done in the next one by when does it show up on the Wikipedias? [17:00:21] prtksxna: those are in wmf2 so wikipedias will get it immediately [17:00:39] manybubbles: Oh right. I'll do that then [17:00:47] wmf3 (going onto test wikis in an hour) won't have it unless you backport it for wmf3 or its already included there [17:00:49] manybubbles: Sorry if I wasted your time, making you wait [17:01:05] its cool - I mostly just pinged you and got back to work [17:01:40] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:01:45] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:02:09] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:02:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [17:03:02] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:03:11] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:04:00] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:04:40] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:07:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [17:11:13] greg-g, ping. [17:11:13] subbu: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [17:12:17] aha .. greg-g i assume you've seen my full-of-content ping on #mediawiki-parsoid :) will wait for a pong on it. [17:12:27] or rather that you'll see it. [17:14:23] subbu: hey, sorry, in meetings [17:14:28] :/ [17:14:34] subbu: I assume it's ok :) [17:16:29] greg-g, ok .. is there a good time when we should do it? 1pm PST good? [17:27:32] subbu: that should be fine, yeah [17:28:08] k, thanks. [17:37:29] bd808: holler when yo uhave some free time, I"m eating dinner right now but after tht ready to try some salt install and play on deployment-prep [17:38:48] apergos: "free" is a relative term. :) I should be available after ~19:00Z [17:40:02] so 10 pm here... that works for me [17:40:35] I'll check in around then, thanks! [17:44:40] RECOVERY - Disk space on analytics1035 is OK: DISK OK [17:55:49] !log done reimaging of mw1028. Now hhvm_appserver [17:55:56] Logged the message, Master [17:56:28] !log begin reimaging of mw1027 [17:56:33] Logged the message, Master [17:57:09] PROBLEM - HHVM rendering on mw1028 is CRITICAL: Connection refused [17:57:34] Stupid pybal faster than I intended. [17:57:41] (Already depooled ^^) [17:59:30] PROBLEM - Apache HTTP on mw1028 is CRITICAL: Connection refused [18:00:05] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141009T1800). [18:25:39] Anyone around for some puppet advice? [18:26:16] (please) :) [18:26:30] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/beta.pp#L101 [18:26:53] can I just move that code into another class, and include the other class in multiple places in beta? [18:27:50] Reedy: Yes [18:28:13] Puppet 3 doesn't have dynamic scooping so it's rather sane [18:28:24] (03PS1) 10Reedy: Make beta jobrunners use beta nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/165770 [18:28:32] So that's bug fix number 1 [18:28:39] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [18:29:14] Just noticed that jobrunners in beta are using production memcached and spamming tonnes of errors [18:29:49] DAMN IT [18:30:26] Reedy: they shouldn't even be able to connect [18:30:31] or is that what you're seeing [18:30:35] if so, lulz [18:30:43] they're trying to [18:30:44] And failing [18:30:46] miserably [18:31:06] (03PS2) 10Reedy: Make beta jobrunners use beta nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/165770 [18:31:51] sigh [18:31:53] (03PS3) 10Reedy: Make beta jobrunners use beta nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/165770 [18:32:11] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:59] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165773 [18:33:01] (03PS1) 10Reedy: testwiki to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165774 [18:33:03] (03PS1) 10Reedy: Wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165775 [18:33:05] (03PS1) 10Reedy: group0 to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165776 [18:33:24] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165773 (owner: 10Reedy) [18:33:31] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165773 (owner: 10Reedy) [18:33:44] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165774 (owner: 10Reedy) [18:33:51] (03Merged) 10jenkins-bot: testwiki to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165774 (owner: 10Reedy) [18:34:01] !log reedy Started scap: testwiki to 1.25wmf3 and build l10n cache [18:34:10] Logged the message, Master [18:38:45] (03CR) 10Reedy: [C: 031] "Cherry picked onto beta" [puppet] - 10https://gerrit.wikimedia.org/r/165770 (owner: 10Reedy) [18:42:34] !log reedy scap failed: TypeError bufsize must be an integer (duration: 08m 33s) [18:42:41] bd808: ^^ lol [18:42:42] Logged the message, Master [18:42:48] eek [18:43:00] bah. what's that from. [18:43:07] * bd808 goes to look at logs [18:43:19] bd808: http://p.defau.lt/?NoGn1xPjNmVe_T1t5b6yrw [18:44:06] ugh. Ori's fix for the bug Tim hit [18:45:21] (03PS1) 10Reedy: Setting $wgMemCachedServers = array(); [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165778 [18:45:32] (03CR) 10Reedy: [C: 032] Setting $wgMemCachedServers = array(); [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165778 (owner: 10Reedy) [18:45:41] (03Merged) 10jenkins-bot: Setting $wgMemCachedServers = array(); [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165778 (owner: 10Reedy) [18:46:08] See if that does anything for beta... [18:48:22] (03PS1) 10Ori.livneh: add `keyholder` module for managing a shared ssh-agent [puppet] - 10https://gerrit.wikimedia.org/r/165779 [18:48:26] Reedy: I see the bug. missing parens [18:50:30] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:51:19] Reedy: I think I fixed it. Give it a shot [18:51:23] Thanks [18:51:29] thanks [18:51:39] !log cherry-picked I3ae9edab2505c37945fe66863721913a6d33223c to scap [18:51:45] Logged the message, Master [18:51:47] !log reedy Started scap: testwiki to 1.25wmf3 and build l10n cache (take 2) [18:51:52] Logged the message, Master [18:56:33] (03CR) 10Reedy: "deployment-videoscaler01 also uses production nutcracker. Probably should move I should move include ::role::beta::nutcracker into role::b" [puppet] - 10https://gerrit.wikimedia.org/r/165770 (owner: 10Reedy) [18:59:17] akosiaris, paravoid: is there any timeline for upgrading the parsoid boxes to trusty? (just curious ... no pressure, I'm sure you're busy) [18:59:23] (03CR) 10BryanDavis: "Pretty fancy. What kind of audit logging should this do?" [puppet] - 10https://gerrit.wikimedia.org/r/165779 (owner: 10Ori.livneh) [19:00:50] (03PS4) 10Reedy: Make beta jobrunners use beta nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/165770 [19:01:19] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:01:36] 18:58:16 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1028 returned [127]: bash: /srv/deployment/scap/scap/bin/sync-common: No such file or directory [19:02:02] Coren: Have you finished with mw1028? [19:02:04] mw1028 is missing scap parts? [19:02:14] Coren reimaged that one AFAIR [19:02:49] https://gerrit.wikimedia.org/r/165746 [19:03:03] Saw in the SAL [19:03:04] hence asking :) [19:03:26] Reedy: trebuchet hasn't run there. Maybe needs salt key signed? [19:03:46] There's not /srv/deployment directory [19:12:50] <_joe_> Reedy: not yet [19:12:58] <_joe_> Reedy: I can finish it now if needed [19:13:18] _joe_: mw1028? [19:13:26] <_joe_> yes [19:13:32] I'm guessing it's not currently pooled? [19:14:09] RECOVERY - check if salt-minion is running on mw1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:14:12] It's not urgent [19:14:46] ie it doesn't need doing now [19:16:01] PROBLEM - check if salt-minion is running on mw1040 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:09] PROBLEM - check if salt-minion is running on mw1055 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:12] PROBLEM - check if salt-minion is running on tungsten is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:16] PROBLEM - check if salt-minion is running on mw1042 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:27] <_joe_> wat? [19:16:29] PROBLEM - check if salt-minion is running on mw1047 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:47] hey-lo! [19:16:57] can anyone here process access-requests ? [19:17:09] RECOVERY - check if salt-minion is running on mw1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:17:09] RECOVERY - check if salt-minion is running on mw1055 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:17:10] RECOVERY - check if salt-minion is running on mw1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:17:18] <_joe_> cscott: whoever's on duty [19:17:21] i'd like to make sure that arlolra can deploy ocg before i go on vacation next week, but his shell access request has been in limbo for a week. [19:17:26] so who's on duty? [19:17:26] <_joe_> look @topic [19:17:29] RECOVERY - check if salt-minion is running on mw1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:17:36] andrewbogott: you're it! [19:17:51] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:18:02] heh I guess the salt-minion check shouldn't fail just because a salt command is currently running [19:18:05] cscott: I am! Do you have a ticket # by chance? [19:18:17] (probably also the dpkg check shouldn't fail just because an apt command is currently being run) [19:18:32] andrewbogott: rt 8505 i think? [19:18:50] RECOVERY - DPKG on tungsten is OK: All packages OK [19:19:11] RECOVERY - check if salt-minion is running on tungsten is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:19:26] <_joe_> Reedy: FYI, it's running scap now [19:19:27] _joe_: A slightly more important thing would be how we do heira for labs... https://github.com/wikimedia/operations-puppet/blob/9ad61aa3c94169e4c5d376371766b2e6983bb46b/modules/puppetmaster/files/labs.hiera.yaml#L12 [19:19:50] [20:11:24] So labs/deployment-prep.yaml I think [19:19:59] <_joe_> yes [19:20:24] <_joe_> Reedy: maybe tomorrow? I started working at 8 am :) [19:20:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [19:20:55] <_joe_> Reedy: I should be here around 14Z tomorro if we want to work a little on this [19:21:07] I'm not gonna be around a lot of tomorrow [19:21:48] !log reedy Finished scap: testwiki to 1.25wmf3 and build l10n cache (take 2) (duration: 30m 00s) [19:21:49] I noticed a lot of labs was using production nutcracker config :( [19:21:54] Logged the message, Master [19:22:00] uh [19:22:02] s/labs/beta/ [19:23:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [19:23:49] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:24:04] (03PS2) 10Reedy: Wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165775 [19:24:17] (03PS1) 10Dzahn: salt_minion monitoring - only CRIT if > 2 [puppet] - 10https://gerrit.wikimedia.org/r/165840 [19:24:22] bblack: ^ [19:26:34] (03CR) 10Reedy: [C: 032] Wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165775 (owner: 10Reedy) [19:26:43] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165775 (owner: 10Reedy) [19:27:20] (03CR) 10Jforrester: [C: 031] "Due now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:27:27] cscott: can you explain about the request for 'deployment-prep' for arlo? That should be something that you or another beta admin can give them. [19:27:39] Ah, so it says in the request. nevermind [19:27:39] yes, i didn't know that at the time [19:27:40] Reedy: Ping. https://gerrit.wikimedia.org/r/#/c/157477/ is scheduled for this deploy window; sorry it was still C-1'ed. [19:27:56] andrewbogott: i think i've already given him deployment-prep [19:28:04] James_F: That's fine, still going through this deploy, and attempting to fix beta at the same time etc [19:28:13] <_joe_> mutante: thanks [19:28:13] Reedy: My sympathies. [19:28:23] James_F: yes, i'm preparing to deploy the jjb config change now [19:28:41] (03CR) 10Dzahn: [C: 032] salt_minion monitoring - only CRIT if > 2 [puppet] - 10https://gerrit.wikimedia.org/r/165840 (owner: 10Dzahn) [19:28:44] cscott: Cool. Different channel for that conversation, though. :-) [19:28:54] (03PS1) 10Andrew Bogott: Provide access for Arlo Breault: parsoid-admin and ocg-render-admin [puppet] - 10https://gerrit.wikimedia.org/r/165847 [19:30:59] (03CR) 10Cscott: [C: 031] Provide access for Arlo Breault: parsoid-admin and ocg-render-admin [puppet] - 10https://gerrit.wikimedia.org/r/165847 (owner: 10Andrew Bogott) [19:31:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.25wmf2 [19:31:14] Logged the message, Master [19:31:43] andrewbogott: i don't have +2 rights on puppet, so you'll need to find another reviewer. [19:31:56] (03CR) 10Arlolra: [C: 031] Provide access for Arlo Breault: parsoid-admin and ocg-render-admin [puppet] - 10https://gerrit.wikimedia.org/r/165847 (owner: 10Andrew Bogott) [19:32:28] (03CR) 10Andrew Bogott: [C: 032] Provide access for Arlo Breault: parsoid-admin and ocg-render-admin [puppet] - 10https://gerrit.wikimedia.org/r/165847 (owner: 10Andrew Bogott) [19:32:47] (03PS2) 10Reedy: group0 to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165776 [19:32:53] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165776 (owner: 10Reedy) [19:33:07] (03Merged) 10jenkins-bot: group0 to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165776 (owner: 10Reedy) [19:34:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf3 [19:34:11] Logged the message, Master [19:34:56] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:34:58] James_F: ^^ [19:35:09] Bleh. [19:35:52] Reedy: https://gerrit.wikimedia.org/r/164884 easy peasy [19:36:04] What on Earth was done to need that? [19:36:17] (03CR) 10Reedy: "This should probably be fixed by using a heira file for nutcracker for labs. Then the config from here can presumably be removed completel" [puppet] - 10https://gerrit.wikimedia.org/r/165770 (owner: 10Reedy) [19:36:32] (03CR) 10Jforrester: "PS2 is a rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:36:33] James_F: Not sure. A week or 2 ago gerrit wouldn't rebase anything for some stupid reason [19:36:35] (03PS2) 10Jforrester: Enable TemplateData GUI on remaining big Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) [19:36:39] Then it started working again [19:36:53] Reedy: Maybe that's it. git rebase origin/master && git review just worked. [19:36:57] yeah [19:37:03] We almost need a rebase bot for that [19:37:08] andrewbogott: thanks [19:37:12] for these trivial rebases that jgit fails on [19:37:25] (03CR) 10Dzahn: "please link access changes to a ticket. was actually reviewing" [puppet] - 10https://gerrit.wikimedia.org/r/165847 (owner: 10Andrew Bogott) [19:37:40] arlolra: it'll take 30 mins or so for the change to spread. Please let me know if things are not working in an hour. [19:37:51] (03PS3) 10Reedy: Enable TemplateData GUI on remaining big Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:37:54] Reedy: Or maybe just wait for Phabricator to come along and solve everything? :-) [19:37:55] (03CR) 10Reedy: [C: 032] Enable TemplateData GUI on remaining big Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:38:02] mutante: yeah, I noticed that I didn't add the ticket # a second after I merged. [19:38:08] The ticket, at least, links to the change. [19:38:12] * James_F grins. [19:38:15] (03Merged) 10jenkins-bot: Enable TemplateData GUI on remaining big Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157477 (https://bugzilla.wikimedia.org/70217) (owner: 10Jforrester) [19:38:28] brb, going to find something to drink [19:38:57] (03PS2) 10Reedy: Remove unused $wmgMediaViewerBeta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164884 (owner: 10Hoo man) [19:39:03] (03CR) 10Reedy: [C: 032] Remove unused $wmgMediaViewerBeta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164884 (owner: 10Hoo man) [19:39:14] (03Merged) 10jenkins-bot: Remove unused $wmgMediaViewerBeta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164884 (owner: 10Hoo man) [19:40:53] andrewbogott: i see the ticket now. gotcha, also confirmed by gwicke,thx [19:41:49] (03CR) 10Jforrester: [C: 04-1] "Waiting for wmf4." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester) [19:42:14] ok [19:42:20] !log upgrading elastic1014 [19:42:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:42:20] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:42:24] James_F: You could've just used 'large' => true ;) [19:42:26] Logged the message, Master [19:42:30] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [19:42:45] Reedy: Eh, but the follow-up just sets default => true instead. :-) [19:42:57] Reedy: (Follow-up due in a couple of weeks.) [19:42:59] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.085 second response time [19:43:08] hmm. has gerrit been updated lately or something? [19:43:29] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [19:43:29] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 66959 bytes in 0.300 second response time [19:44:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [19:44:45] (03PS2) 10Reedy: Add 'abusefilter-modify-restricted' to sysops at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165704 (https://bugzilla.wikimedia.org/71854) (owner: 10Glaisher) [19:44:49] (03CR) 10Reedy: [C: 032] Add 'abusefilter-modify-restricted' to sysops at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165704 (https://bugzilla.wikimedia.org/71854) (owner: 10Glaisher) [19:44:57] (03Merged) 10jenkins-bot: Add 'abusefilter-modify-restricted' to sysops at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165704 (https://bugzilla.wikimedia.org/71854) (owner: 10Glaisher) [19:44:59] (03PS1) 10Andrew Bogott: Change wikitech backup crons to use new, proper dirs. [puppet] - 10https://gerrit.wikimedia.org/r/165859 [19:45:10] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [19:45:19] (03PS3) 10Reedy: Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) (owner: 10Calak) [19:45:23] (03CR) 10Reedy: [C: 032] Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) (owner: 10Calak) [19:45:30] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [19:45:32] (03Merged) 10jenkins-bot: Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) (owner: 10Calak) [19:45:54] (03CR) 10Dzahn: [C: 032] rolematcher - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165677 (owner: 10Dzahn) [19:45:58] (03CR) 10Reedy: [C: 04-1] "Need rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160494 (owner: 10Awight) [19:46:14] (03PS2) 10Reedy: Remove unused log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165404 (owner: 10MaxSem) [19:46:18] (03CR) 10Reedy: [C: 032] Remove unused log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165404 (owner: 10MaxSem) [19:46:20] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [19:46:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:46:30] (03Merged) 10jenkins-bot: Remove unused log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165404 (owner: 10MaxSem) [19:47:27] (03PS2) 10Reedy: Prevent search engines from indexing user pages and all talk pages on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164766 (https://bugzilla.wikimedia.org/71663) (owner: 10Calak) [19:47:30] (03CR) 10Reedy: [C: 032] Prevent search engines from indexing user pages and all talk pages on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164766 (https://bugzilla.wikimedia.org/71663) (owner: 10Calak) [19:47:42] (03Merged) 10jenkins-bot: Prevent search engines from indexing user pages and all talk pages on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164766 (https://bugzilla.wikimedia.org/71663) (owner: 10Calak) [19:48:19] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 6 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:48:21] (03CR) 10Andrew Bogott: [C: 032] Change wikitech backup crons to use new, proper dirs. [puppet] - 10https://gerrit.wikimedia.org/r/165859 (owner: 10Andrew Bogott) [19:49:09] (03CR) 10Dzahn: [C: 032] "not used, checked on neon" [puppet] - 10https://gerrit.wikimedia.org/r/165678 (owner: 10Dzahn) [19:49:13] (03CR) 10Reedy: "I note we seemed to have something similar on beta... I just did https://gerrit.wikimedia.org/r/165778 instead... I'm not sure which is be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161005 (owner: 10BryanDavis) [19:51:19] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:51:35] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [19:51:40] Logged the message, Master [19:52:07] !change 165676 | cscott [19:52:18] (that bot trigger was actually nice) [19:52:32] it used to get the gerrit link and ping [19:53:38] mutante: probably doesn't work on this channel. try on #-dev [19:54:31] mutante: Indeed! [19:55:12] MatmaRex: that works, thanks [19:55:22] at some point it was here i think [19:56:22] (03CR) 10Manybubbles: [C: 031] "Burn it with fire." [puppet] - 10https://gerrit.wikimedia.org/r/165672 (owner: 10Dzahn) [19:56:55] (03CR) 10Cscott: [C: 031] "LGTM, i don't have +2 rights in puppet though." [puppet] - 10https://gerrit.wikimedia.org/r/165676 (owner: 10Dzahn) [19:57:17] wow, that's really effective [19:57:20] thanks [19:57:36] (03CR) 10Chad: [C: 031] elasticsearch - delete pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165672 (owner: 10Dzahn) [19:57:48] (03CR) 10Dzahn: [C: 032] elasticsearch - delete pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165672 (owner: 10Dzahn) [19:58:30] (03PS3) 10Dzahn: remove pdf servers,role::pdf and misc pdf class [puppet] - 10https://gerrit.wikimedia.org/r/165676 [19:59:02] (03CR) 10Dzahn: [C: 032] remove pdf servers,role::pdf and misc pdf class [puppet] - 10https://gerrit.wikimedia.org/r/165676 (owner: 10Dzahn) [19:59:42] (03PS1) 10Ori.livneh: add auditd module; add auditd rules for keyholder [puppet] - 10https://gerrit.wikimedia.org/r/165862 [19:59:48] mutante: :D you could get one of the bot's admins to share the bot's brain from #-dev with here, it's already shared with #mediawiki and maybe some others [19:59:50] I trust: petan|w.*wikimedia/Petrb (2admin), .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@mediawiki/Catrope (2admin), .*@wikimedia/RobH (2admin), .*@wikimedia/Ryan-lane (2admin), petan!.*@wikimedia/Petrb (2admin), .*@wikimedia/Krinkle (2admin), [19:59:50] @trusted [20:00:14] (03CR) 10Ori.livneh: "@bd808: See follow-up patch, https://gerrit.wikimedia.org/r/#/c/165862/" [puppet] - 10https://gerrit.wikimedia.org/r/165779 (owner: 10Ori.livneh) [20:00:15] but i suppose there might have been a reason it wasn't done [20:00:35] yea, it's always controversial which bot should be where [20:00:40] either works for me [20:01:15] <^d> mutante: Would you mind having a look at https://gerrit.wikimedia.org/r/#/c/165602/? [20:01:18] !botbrain [20:01:32] (hmph, that doesn't work here either. nevermind.) [20:01:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:01:50] ^d: oh, i saw that yesterday, yea. and i see +1 from chris, so .. sure [20:02:02] <^d> thx [20:02:33] (03CR) 10Dzahn: [C: 032] Gerrit: explicitly whitelist image formats we want to display [puppet] - 10https://gerrit.wikimedia.org/r/165602 (https://bugzilla.wikimedia.org/70892) (owner: 10Chad) [20:02:54] (03PS1) 10Jhobs: Add 437-05 to unified baselining [puppet] - 10https://gerrit.wikimedia.org/r/165863 [20:03:02] !log rebooting samarium [20:03:08] Logged the message, Master [20:05:10] (03PS2) 10Jforrester: Enable TemplateData GUI for all wikis; move config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) [20:05:35] (03CR) 10Jforrester: "Scheduled for 6 November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester) [20:05:59] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 21 MB (4% inode=99%): [20:06:08] FUUUUUUUUUUUUUUUU [20:06:25] hashar: ^^ [20:06:44] that's the job thing [20:06:46] bblack: mind taking a look at https://gerrit.wikimedia.org/r/#/c/165863/ and then approving and merging and deploying if it looks good? jhobs is the new addition to partners / zero team. [20:07:06] Reedy: there's a bug for it [20:07:12] (03CR) 10Ori.livneh: First of (hopefully many) es-tool commands (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [20:07:18] Reedy: cleaning it [20:07:20] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 19 MB (3% inode=99%): [20:07:42] bd808: Right. But I was presuming it was a sign jenkins was going to crap out :) [20:09:19] RECOVERY - Disk space on lanthanum is OK: DISK OK [20:09:33] bug is logged and there is a way to nicely garbage collect them [20:10:00] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [20:10:29] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: puppet fail [20:17:30] deploying new version of parsoid ... [20:29:50] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:33:42] !log deployed parsoid version 644071d2 [20:33:49] Logged the message, Master [20:35:32] (03PS1) 10Cscott: Give cscott the ability to deploy zuul changes. [puppet] - 10https://gerrit.wikimedia.org/r/165867 [20:36:04] hashar: ^ although i suspect an RT ticket # and an email to access-requests is probably the way I *should* be doing this [20:36:59] cscott: what kind of access do you need? [20:37:08] ah [20:37:21] mutante: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Deploy_configuration :) [20:37:39] (03CR) 10Hashar: [C: 031] "I trust C. Scott :)" [puppet] - 10https://gerrit.wikimedia.org/r/165867 (owner: 10Cscott) [20:37:42] cscott: +1ed :° [20:38:13] mutante: hashar and i have been fixing up parsoid's jenkins jobs over on #-qa [20:38:46] sounds very reasonable to me, just that quick mail to access-requests please for the trail [20:39:09] mutante: thanks a lot for tampa work [20:40:11] matanya: :) getting closer [20:55:17] (03PS7) 10Ori.livneh: base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 [20:55:20] ^ _joe_ [20:57:38] (03CR) 10Krinkle: [C: 031] "Familiarise with https://www.mediawiki.org/wiki/CI/JJB and https://www.mediawiki.org/wiki/CI/Z if not already. Assume the docs are perfect" [puppet] - 10https://gerrit.wikimedia.org/r/165867 (owner: 10Cscott) [21:00:30] (03CR) 10Cscott: "Krinkle -- I already have JJB access. But thanks for the warnings re zuul." [puppet] - 10https://gerrit.wikimedia.org/r/165867 (owner: 10Cscott) [21:13:58] cscott: ldap/wmf can push to Jenkins, but that's not by design and doesn't mean everyone should actually access it :) [21:14:15] so JJB access is kind of granted implicitly/socially. most ppl just don't know how. [21:14:16] :) [21:14:51] always push to gerrit first, and merge right after pushing to jenkins from your local machine. [21:16:22] Krinkle: yup. [21:16:28] cscott: +2 everything Timo says :] [21:20:39] (03PS8) 10Giuseppe Lavagetto: base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [21:20:51] (03CR) 10Giuseppe Lavagetto: [C: 032] base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [21:21:24] _joe_: thanks! [21:21:54] <_joe_> ori: I'll merge that [21:22:04] (03PS2) 10BBlack: Add 437-05 to unified baselining [puppet] - 10https://gerrit.wikimedia.org/r/165863 (owner: 10Jhobs) [21:22:14] (03CR) 10BBlack: [C: 032 V: 032] Add 437-05 to unified baselining [puppet] - 10https://gerrit.wikimedia.org/r/165863 (owner: 10Jhobs) [21:22:55] andrewbogott: ssh bast1001.wikimedia.org [21:22:55] Permission denied (publickey). [21:23:05] :( [21:23:53] What username arlolra? [21:24:15] arlolra: try again while I watch the logs? [21:24:25] whoami [21:24:25] arlolra [21:24:50] You don't have a home directory on bast1001 [21:24:51] andrewbogott: 4 attempts just now [21:24:52] ok, says Invalid user arlolra, I'll investigate [21:25:27] Reedy, andrewbogott: thanks [21:30:30] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:32:30] RECOVERY - DPKG on labmon1001 is OK: All packages OK [21:33:10] arlolra: it doesn't look like the groups you requested membership in include access to virt1001. Did you mean to request access to 'deployment' as well? [21:33:13] cscott, any idea? [21:33:34] bblack: thanks for looking that vcl change [21:33:55] andrewbogott: it's possible the docs i wrote about the set of groups required to deploy ocg are incomplete. [21:34:19] virt1001 is a bastion, right? (i should read the scrollback) [21:34:59] chasemp: does the admin module support group-in-group? It would sort of make sense to have many groups include an implicit 'bastion' membership rather than having the bastions enumerate all groups that need access. [21:35:02] dr0ptp4kt: np :) [21:35:20] cscott: what about virt1001? [21:35:20] <_joe_> .win 22 [21:35:27] andrewbogott: it doesn't and I've thought about it, but never did it [21:35:27] <_joe_> grrr [21:35:32] chasemp: ok [21:35:36] mainly because bastion was the only example I could think of [21:35:41] and it was too abstracted for one case [21:35:44] andrewbogott: you need access to tin and deployment-bastion in order to deploy ocg. [21:35:45] it would confuse more than help I thought [21:36:14] cscott: ok, so arlolra is trying to connect to bast1001 because…? [21:36:37] Is tin public or do you need to go via bast1001 to get there? [21:36:41] * andrewbogott should really know this [21:37:00] well, anyway, the only group with access to tin is 'deployment' [21:37:05] andrewbogott: you need bastion [21:37:08] andrewbogott: i think i go via bastion [21:37:26] <^d> tin is not public. [21:37:26] andrewbogott: that's what they're telling me to do [21:37:41] andrewbogott: there is also a "bastion-only" group if needed [21:38:00] maybe deployment-prep should have just been deployment in that email [21:38:07] so i'm checking puppet -- i'm a member of parsoid-admin, deployment, ocg-render-admins, (and pdf-qa-users, which i have no idea what it does) [21:38:15] so i guess 'deployment' is the odd dog out there. [21:39:25] that's what used to be "mortals" in the past [21:39:36] andrewbogott: you added me to parsoid-admin in 0ef350163984322e3d99b09ac1cecc7d855eb6d9 but i was already a member of deployment at that time [21:39:39] the group who deploys mediawiki [21:39:46] (03PS1) 10Andrew Bogott: Add arlolra to deployment as well. [puppet] - 10https://gerrit.wikimedia.org/r/165892 [21:39:50] <^d> We should've kept mortals. [21:40:17] mere mortals [21:40:32] are you going to deploy mw? [21:40:50] I hope not [21:40:59] i was added to deployment in RT #7542 [21:41:05] andrewbogott: you could add him to bastion only [21:41:10] ocg and parsoid [21:41:23] (03CR) 10Hoo man: [C: 04-1] "If they are not supposed to be a deployer, don't make them one. We have a "bastiononly" group these days." [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:41:38] (03CR) 10Dzahn: "if all you want is to add bastion to the existing groups, use "bastiononly" group. deployment not really needed" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:41:40] Currently the only group with access to tin is 'deployment' [21:41:42] just tell me what group you choose so i can add it to the "you must be members of these groups to deploy" documentation for OCG and Parsoid [21:41:58] andrewbogott: Why is tin needed? [21:42:08] tin is where we stage git-deploy for ocg and parsoid [21:42:16] > Tin has many uses. It takes a high polish and is used to coat other metals to prevent corrosion, such as in tin cans which are made of tin-coated steel. Alloys of tin are important, such as soft solder, pewter, bronze and phosphor bronze. [21:42:16] that sounds like deploying to me [21:42:41] YuviPanda: Correct answer, yet totally useless :D [21:42:48] https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes and https://wikitech.wikimedia.org/wiki/OCG#Deploying_changes [21:42:52] hoo: unlike Tin! :) [21:43:09] (03CR) 10Hoo man: " andrewbogott: Why is tin needed?" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:43:28] ha [21:44:33] (03CR) 10Dzahn: "i guess we make no difference so far between _what_ is deployed and this is deployment, just another kind because it's not mediawiki.. shr" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:45:49] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:46:01] (03CR) 10Dzahn: [C: 031] "...unless we want to introduce parsoid-deployers and add it to tin, which would also seem good" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:46:09] andrewbogott: Creating a custom group would be rather easy these days, if you care enough [21:47:49] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: puppet fail [21:48:17] there's already parsoid-admin and ocg-admin [21:48:24] if you add those to tin we'd be good i think. [21:49:46] (03CR) 10Cscott: "There's already parsoid-admin and ocg-render-admins -- let's use those instead of inventing a new parsoid-deployers group (if we wanted to" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:50:44] (03CR) 10Dzahn: "excellent point. yea, let's add one of those to tin, where they are needed" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:51:07] and bastion [21:51:08] (03PS1) 10Andrew Bogott: Add parsoid-admin ocg-render-admin to tin and bast1001. [puppet] - 10https://gerrit.wikimedia.org/r/165897 [21:51:10] (03CR) 10Cscott: "Oh, but note that the /srv/deployment/ocg and /srv/deployment/parsoid directories are both setgid wikidev. So we'd need to create new uni" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:51:12] and whatever else we're forgetting [21:51:32] andrewbogott: we need new unix groups to own the code on tin as well. [21:51:57] Groups other than wikidev? [21:52:07] Nobody needs groups other than wikidev :D [21:53:34] (03CR) 10Dzahn: [C: 031] Add parsoid-admin ocg-render-admin to tin and bast1001. [puppet] - 10https://gerrit.wikimedia.org/r/165897 (owner: 10Andrew Bogott) [21:57:42] hoo: is wikidev connected to the deployment group? or are we all wikidev? [21:58:05] (03CR) 10Cscott: "Just for reference, here are the install directions for Parsoid: https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [21:58:21] cscott: Everyone is wikidev [21:58:26] ok then [21:58:27] in the past it was the only gid we had [21:58:28] :P [21:58:33] That's the point of that joke [21:59:04] (03CR) 10Cscott: "It looks like https://gerrit.wikimedia.org/r/165897 is the preferred solution now." [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [22:00:10] chasemp: have any reservations about https://gerrit.wikimedia.org/r/#/c/165897/ ? [22:00:56] andrewbogott: only that ocg-render-admins is plural. [22:01:54] cscott: andrewbogott . oh. heh, i just saw this [22:01:57] (03CR) 10Cscott: [C: 04-1] Add parsoid-admin ocg-render-admin to tin and bast1001. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165897 (owner: 10Andrew Bogott) [22:01:58] "cscott is requesting access to tmh* hosts. such as thm1001." [22:02:11] i should have moved that to access requests queue earlier.. does now [22:02:34] probably because duplicate tickets were merged [22:02:37] (03PS2) 10Andrew Bogott: Add parsoid-admin ocg-render-admin to tin and bast1001. [puppet] - 10https://gerrit.wikimedia.org/r/165897 [22:02:46] mutante: oh, right -- last time i actually tried to deploy a *mediawiki* config (since i am a member of deployment, doncha know) it failed on thm* [22:03:47] (03PS2) 10Dzahn: Give cscott the ability to deploy zuul changes. [puppet] - 10https://gerrit.wikimedia.org/r/165867 (owner: 10Cscott) [22:04:33] cscott: while at it.. linked that too. so you have 2 tickets, one for each [22:04:41] (03CR) 10Cscott: [C: 031] Add parsoid-admin ocg-render-admin to tin and bast1001. [puppet] - 10https://gerrit.wikimedia.org/r/165897 (owner: 10Andrew Bogott) [22:04:49] PROBLEM - check if salt-minion is running on analytics1003 is CRITICAL: NRPE: Command check_check_salt_minion not defined [22:08:41] (03CR) 10Dzahn: [C: 031] Give cscott the ability to deploy zuul changes. [puppet] - 10https://gerrit.wikimedia.org/r/165867 (owner: 10Cscott) [22:09:44] andrewbogott, not sure on giving them perms from policy angle, but syntax is good [22:10:21] (03CR) 10Dzahn: [C: 031] "yea, this seems nicer than Change-Id: Ie9fd2d3f5358" [puppet] - 10https://gerrit.wikimedia.org/r/165897 (owner: 10Andrew Bogott) [22:10:48] https://gerrit.wikimedia.org/r/#/c/165903/ [22:10:56] sorry, wrong channel [22:11:17] (03CR) 10Dzahn: "agree, Change-Id: Ib519628c3f33e seems nicer" [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [22:12:30] (03Abandoned) 10Andrew Bogott: Add arlolra to deployment as well. [puppet] - 10https://gerrit.wikimedia.org/r/165892 (owner: 10Andrew Bogott) [22:12:46] (03CR) 10Andrew Bogott: [C: 032] Add parsoid-admin ocg-render-admin to tin and bast1001. [puppet] - 10https://gerrit.wikimedia.org/r/165897 (owner: 10Andrew Bogott) [22:12:57] whoo [22:13:22] of course i just noticed that the commit summary uses ocg-render-admin instead of ocg-render-admins. oh well. [22:16:00] arlolra: try now [22:16:41] :) [22:17:18] andrewbogott, et al.: thanks [22:19:40] does the puppet change require some time to propagate? [22:20:16] Yep... every server affected needs at least one puppet run [22:20:27] might take up to ~30 mins [22:20:50] that, unless we speed it up by manually running it [22:22:27] I ran it on bast1001 and tin [22:23:47] (03PS1) 10Ottomata: Grant analytics shell account access to Marcel Ruiz Forns [puppet] - 10https://gerrit.wikimedia.org/r/165909 [22:27:33] verified I could access both. thanks [22:27:43] (03PS2) 10Ottomata: Grant analytics shell account access to Marcel Ruiz Forns [puppet] - 10https://gerrit.wikimedia.org/r/165909 [22:29:59] (03PS1) 10Christopher Johnson (WMDE): fix typo in static yaml phab priority settings file [puppet] - 10https://gerrit.wikimedia.org/r/165911 [22:45:08] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141009T2300). [23:00:20] * hoo waves to whoever is going to do SWAT [23:00:37] * hoo has https://gerrit.wikimedia.org/r/165913 https://gerrit.wikimedia.org/r/165914 [23:00:38] * MaxSem will do [23:00:42] :) [23:01:05] hoo, I patches from wiki only XD [23:01:11] What [23:01:13] ? [23:01:13] *I get patches [23:01:19] (03CR) 10Ori.livneh: "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [23:01:20] It's in the Wiki [23:01:24] but not the core bumps [23:01:30] Reedy: wanna amend that? ^^ [23:01:37] I can +2 myself if you don't want to [23:03:07] prtksxna, yt? [23:03:15] MaxSem: o/ [23:03:18] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [23:10:15] andrewbogott or legoktm, yt? [23:10:22] I am [23:10:23] I am [23:10:26] cool [23:11:58] any elasticsearch/logstash magicians available? [23:12:11] jgage: you? [23:12:20] cajoel: you summoned? [23:12:25] awesome [23:12:35] MaxSem: am here :) [23:12:52] I have weeks of old crufty logs that I'd like to import in to my elasticseach (my house -- oit-- not production) [23:13:02] (03PS1) 10Dzahn: gerrit - add 'phab' short link to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/165923 [23:13:15] even though I /think/ I'm using a fitler to READ the actual timestamps, the data keeps showing up at the time I import it [23:13:27] heh. [23:13:33] instead of re-aligned to the actual dates the events happened [23:13:39] any guidelines on ingesting old logs? [23:13:52] syslog for the most part [23:13:56] paste of your filter config? [23:14:06] (03PS1) 10Ori.livneh: mediawiki: remove cruft from apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/165924 [23:14:28] hrm [23:14:34] (03CR) 10Dzahn: [C: 031] "so that we can start linking from gerrit to phab tasks, k?" [puppet] - 10https://gerrit.wikimedia.org/r/165923 (owner: 10Dzahn) [23:14:46] sent in a pm [23:16:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [23:17:50] (03CR) 10Ori.livneh: [C: 031] "I thought the way you made the pattern case-insensitive was a little funny, but sure enough, the docs say: "To match case insensitive stri" [puppet] - 10https://gerrit.wikimedia.org/r/165923 (owner: 10Dzahn) [23:19:16] * MaxSem bites Zuul [23:19:24] (03PS2) 10Dzahn: Linkify Phabricator Task references in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/164880 (owner: 10QChris) [23:19:53] MaxSem: yeah, this is taking really really long. [23:20:22] (03CR) 10Dzahn: "heh, yea, just copied from existing patterns. but it seems i'm a duplicate of https://gerrit.wikimedia.org/r/#/c/164880/1 more or less, s" [puppet] - 10https://gerrit.wikimedia.org/r/165923 (owner: 10Dzahn) [23:20:49] can we just run it on a dozen supercomputers? :| [23:21:47] i do wonder what the bottleneck is (processor, cpu, io bandwidth, etc) [23:21:57] because some of those are rather cheap(relative to engineer time waiting) to fix [23:22:17] (03CR) 10Dzahn: [C: 032] Linkify Phabricator Task references in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/164880 (owner: 10QChris) [23:23:30] ebernhardson: it does everything at least twice, processes queues synchronously and serially even when the jobs have no interdependency, and has a shit-ton of peacock jobs that are non-voting and which are of no value to anyone [23:24:21] wheeee [23:24:26] gerrit is a roast [23:24:33] yea, that doesn't sound like something throwing hardware $$ at will fix :( [23:25:01] if you aim the hardware well and throw it hard enough, it might [23:25:15] :) [23:25:56] (03PS1) 10Dzahn: add reports.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/165927 [23:26:17] (03PS2) 10Dzahn: add reports.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/165927 [23:27:10] (03CR) 10Dzahn: "works! :)" [puppet] - 10https://gerrit.wikimedia.org/r/164880 (owner: 10QChris) [23:27:42] (03CR) 10Dzahn: [C: 04-2] "duplicate of https://gerrit.wikimedia.org/r/#/c/164880/ - use "T" now to link to phab" [puppet] - 10https://gerrit.wikimedia.org/r/165923 (owner: 10Dzahn) [23:27:51] (03Abandoned) 10Dzahn: gerrit - add 'phab' short link to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/165923 (owner: 10Dzahn) [23:29:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [23:29:45] ^ bblack? [23:31:07] MaxSem: are we still mid-swat? [23:31:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [23:33:56] (03CR) 10Dzahn: "this is to discuss if the name is good for this purpose. it will need follow-up to add varnish config on misc-web if we take it. i would s" [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:34:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [23:37:31] andrewbogott, yep - too many submodule updates and zuul silliness [23:37:51] ok [23:38:24] Are you going to scap it all at the end? [23:38:34] is it required? [23:38:48] No, just want to get an idea of how much time it will take [23:39:09] hoo, does WD need a recursive update? [23:39:26] No, it's only one repo [23:39:33] Everything else is embedded there [23:40:05] That's what's needed to use composer for prod. [23:40:09] !log maxsem Synchronized php-1.25wmf2/extensions/Wikidata/: (no message) (duration: 00m 10s) [23:40:15] hoo, ^^ [23:40:16] Logged the message, Master [23:40:19] please test [23:40:25] Already done :) [23:40:29] Fatals are easy to test [23:40:37] wmf3 as well, please [23:41:24] apergos: ping [23:43:07] !log maxsem Synchronized php-1.25wmf3/extensions/Wikidata/: (no message) (duration: 00m 10s) [23:43:11] hoo, ^^ [23:43:12] Logged the message, Master [23:43:14] Thanks :) [23:43:39] Reaching apergos at this time is not possible, I guess? [23:43:46] yup [23:43:49] :S [23:44:07] mutante: Want to do me a favour and do something on snapshot for me? [23:44:37] * hoo would be so happy if we could finally get this access request through *sigh* [23:45:51] hoo: what do you need? [23:45:54] !log maxsem Synchronized php-1.25wmf2/extensions/Flow/: (no message) (duration: 00m 09s) [23:46:00] Logged the message, Master [23:46:06] ori: Two empty files deleted and one cron started per hand [23:46:14] hoo: go on [23:46:29] Nice, give me a moment [23:46:53] ebernharson, ^^^ [23:46:59] (03Abandoned) 10Dzahn: remove 10.0.0.0/16 Tampa subnet from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [23:47:25] hoo@snapshot1003:~$ sudo -u datasets rm /mnt/data/xmldatadumps/public/other/wikidata/2014100* [23:47:28] ori: ^ do that [23:48:07] those are both broken due to a fatal [23:48:19] yes, they're 4k [23:48:21] s/broken/empty/ [23:48:22] done [23:48:25] :) [23:49:09] sudo -u datasets /usr/local/bin/dumpwikidatajson.sh [23:49:14] that's needed to create a new one [23:49:22] you probably want to run that in a screen or so [23:49:26] takes ~10h [23:49:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [23:50:22] i'll just stare at it intently for 10 hours [23:50:50] Awesome :D [23:50:50] hoo: running [23:50:58] !log maxsem Synchronized php-1.25wmf2/extensions/MobileApp: (no message) (duration: 00m 04s) [23:51:02] Yay :) [23:51:03] Thanks [23:51:04] Logged the message, Master [23:52:02] !log maxsem Synchronized php-1.25wmf3/extensions/MobileApp: (no message) (duration: 00m 03s) [23:52:06] Logged the message, Master [23:53:39] MaxSem: time for OSM now? :) [23:54:18] no, we're not deploying OpenStreetMaps today [23:55:23] lol [23:55:41] !log maxsem Synchronized php-1.25wmf2/extensions/OpenStackManager/: (no message) (duration: 00m 04s) [23:55:48] Logged the message, Master [23:56:33] andrewbogott: ^ [23:56:35] (03PS3) 10Ottomata: Grant analytics shell account access to Marcel Ruiz Forns [puppet] - 10https://gerrit.wikimedia.org/r/165909 [23:56:43] legoktm: isn't there another one? [23:56:54] no, it's just one submodule update [23:56:58] ok [23:57:06] and no wmf3? [23:57:17] (too many commits for me today) [23:57:23] oh, we probably should backport to wmf3 [23:57:28] wikitech is still on wmf2 [23:57:35] (03CR) 10Dzahn: [C: 032] "these are down and all UNKNOWN in icinga meanwhile. cleaning that up https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=" [puppet] - 10https://gerrit.wikimedia.org/r/165673 (owner: 10Dzahn) [23:57:43] !log maxsem Synchronized php-1.25wmf3/resources/: (no message) (duration: 00m 04s) [23:57:48] Logged the message, Master [23:57:51] !log maxsem Synchronized php-1.25wmf2/resources/: (no message) (duration: 00m 03s) [23:57:57] Logged the message, Master [23:58:00] prtksxna ^^ [23:58:05] but we don't want the API modules to disappear on thursday.. [23:58:26] pfff [23:58:31] I think I'm done [23:58:33] (03PS4) 10Ottomata: Grant analytics shell account access to Marcel Ruiz Forns [puppet] - 10https://gerrit.wikimedia.org/r/165909 [23:58:42] MaxSem: Checking… [23:58:43] (03CR) 10Ottomata: [C: 032 V: 032] Grant analytics shell account access to Marcel Ruiz Forns [puppet] - 10https://gerrit.wikimedia.org/r/165909 (owner: 10Ottomata) [23:58:52] MaxSem: yeah, we can backport them another day. [23:58:59] legoktm: It's a bi-weekly API... you just need to plan your actions ;) [23:59:18] lol [23:59:38] Thanks MaxSem! [23:59:54] andrewbogott: let me know once you sync wikitech and then I'll test it there