[01:03:00] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:00] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:03:00] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:03:00] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:10] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:03:10] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:04:00] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9064999 keys, up 3 minutes 54 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8972597 keys, up 3 minutes 53 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9066975 keys, up 3 minutes 53 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9069696 keys, up 3 minutes 54 seconds - replication_delay is 0 [01:04:01] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4361997 keys, up 3 minutes 53 seconds - replication_delay is 0 [01:04:01] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4362444 keys, up 3 minutes 57 seconds - replication_delay is 0 [01:17:41] marostegui, you around to do https://phabricator.wikimedia.org/T169396 ? [01:17:57] hi, btw :) [02:40:14] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 14m 45s) [02:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.67 seconds [03:36:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 253.74 seconds [04:57:21] (03PS1) 10BryanDavis: Labs: Update cdnjs clone commands [puppet] - 10https://gerrit.wikimedia.org/r/362928 [05:15:00] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=171.90 Read Requests/Sec=130.70 Write Requests/Sec=55.00 KBytes Read/Sec=825.20 KBytes_Written/Sec=15063.60 [05:16:00] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=150.90 Read Requests/Sec=148.35 Write Requests/Sec=68.43 KBytes Read/Sec=1086.51 KBytes_Written/Sec=836.76 [05:18:20] PROBLEM - HHVM rendering on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:10] RECOVERY - HHVM rendering on mw2133 is OK: HTTP OK: HTTP/1.1 200 OK - 78796 bytes in 0.301 second response time [05:31:46] !log Run redact sanitarium on db1095 - T160869 [05:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:58] T160869: Prepare and check storage layer for kbp.wikipedia.org - https://phabricator.wikimedia.org/T160869 [05:44:04] !log Run redact sanitarium on db1069 - T160869 [05:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:14] T160869: Prepare and check storage layer for kbp.wikipedia.org - https://phabricator.wikimedia.org/T160869 [06:02:29] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3399565 (10ArielGlenn) I'm fine with not deploying on the weekend (as evidenced by this la... [07:15:18] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3399619 (10Volans) I've commented out the `MAILADDR` line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The fi... [07:16:09] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3399624 (10Volans) I've commented out the `MAILADDR` line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The file is currently no... [07:24:43] !log bounced uwsgi-graphite-web on graphite1003, log stopped since Jul 2 10:23:45 [07:24:50] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.011 second response time [07:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:02] godog: FYI ^^^ [07:30:45] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362944 [07:30:48] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362944 [07:33:16] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3399661 (10Volans) And of course that was not enough, I had to also add an `exit 0` to `/etc/cron.daily/mdadm` to prevent it from running, without the `MAILADDR` setting the report check refuses to run... [07:33:24] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3399662 (10Volans) And of course that was not enough, I had to also add an `exit 0` to `/etc/cron.daily/mdadm` to prevent it from running, without the `MAILADDR` setting the report check refuses to run and generates cron... [07:35:37] !log Drop alter table s7 - labsdb1003 - T166208 [07:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:48] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [07:36:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362944 (owner: 10Marostegui) [07:37:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362944 (owner: 10Marostegui) [07:37:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362944 (owner: 10Marostegui) [07:40:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 - T166208 (duration: 03m 00s) [07:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:03] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [07:42:05] (03CR) 10Thiemo Mättig (WMDE): "Looks fine technically. But the commit message should explain where this file comes from, how it was created or converted, and how other d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362606 (https://phabricator.wikimedia.org/T168582) (owner: 10Smalyshev) [07:44:00] (03CR) 10Alexandros Kosiaris: [C: 04-2] "servermon doesn't even yet support django 1.8 (https://github.com/servermon/servermon/blob/master/requirements.txt), let alone django 1.10" [software/servermon] - 10https://gerrit.wikimedia.org/r/362600 (owner: 10Paladox) [07:46:18] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "I can confirm this language was added to core more than two months ago via I2bf03c9 (T163600)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [07:47:23] 10Operations, 10Puppet: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3399696 (10Joe) [07:51:36] !log Deploy alter table db1039 - s7 - T166208 [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:46] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [07:52:51] nothing better than a refreshing alter table on a Monday morning [07:52:59] lol [07:53:48] 10Operations, 10Puppet, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3399730 (10Joe) [07:54:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:55:05] (03PS1) 10Marostegui: db-eqiad.php: Add comments to db1039 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362947 (https://phabricator.wikimedia.org/T166208) [07:58:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comments to db1039 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362947 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [07:59:26] (03Merged) 10jenkins-bot: db-eqiad.php: Add comments to db1039 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362947 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [07:59:35] (03CR) 10jenkins-bot: db-eqiad.php: Add comments to db1039 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362947 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [08:01:08] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3399757 (10elukey) [08:01:10] 10Operations, 10Patch-For-Review, 10User-Elukey: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#3399754 (10elukey) 05Open>03Resolved a:03elukey The bug seems resolved, but for posterity it seems to me that https://gerrit.wikimedia.org/r/362382 would have been... [08:02:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments about db1039 status - T166208 (duration: 02m 49s) [08:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:46] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [08:03:09] volans: thanks! [08:03:29] godog: yw :) [08:06:00] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [08:07:16] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3399759 (10MoritzMuehlenhoff) >>! In T165520#3386850, @faidon wrote: > We talked about this a little bit on IRC. I think we agreed to try stretch with node.js 6, since we're going to have to do that at some poin... [08:08:13] (03CR) 10Elukey: "Comments left unanswered :)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [08:08:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Set up grafana dashboard monitoring for services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) (owner: 10GWicke) [08:12:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [08:12:02] (03CR) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [08:12:14] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3399763 (10akosiaris) 05Open>03stalled >>! In T165520#3399759, @MoritzMuehlenhoff wrote: >>>! In T165520#3386850, @faidon wrote: >> We talked about this a little bit on IRC. I think we agreed to try stretch... [08:13:32] gehel: o/ - Those MW errors seems related to stuff like Pool error on CirrusSearch-NamespaceLookup:_elasticsearch: pool-queuefull [08:13:45] not really urgent but can you double check when you'll have a min? [08:13:58] elukey: sure, having a look (cc dcausse) [08:14:07] thanks! [08:15:08] elastic@eqiad is under heavy load [08:15:10] wow, we do have a significant slowdown on elasticsearch [08:15:29] and not a single server this time... [08:17:03] it looks like it is already recovering [08:17:11] * gehel is tempted to blame a bot :) [08:25:52] !log banning elastic1020 from elasticsearch eqiad waiting for its recovery [08:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:28] (03PS1) 10Filippo Giunchedi: hieradata: enable swift storage policies in codfw [puppet] - 10https://gerrit.wikimedia.org/r/362949 (https://phabricator.wikimedia.org/T151648) [08:29:53] (03PS23) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [08:29:59] !log Compress dewiki on dbstore2001 - T168354 [08:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:09] T168354: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354 [08:32:18] 10Operations, 10Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3399799 (10MoritzMuehlenhoff) Yeah, that's correct, the underlying memory leak isn't fixed, only hidden by the restarts. This is likely still unfixed in stretch, there's nothing in the 2.4.4... [08:35:39] (03PS2) 10Filippo Giunchedi: hieradata: enable swift storage policies in codfw [puppet] - 10https://gerrit.wikimedia.org/r/362949 (https://phabricator.wikimedia.org/T151648) [08:35:41] (03PS1) 10Filippo Giunchedi: swift: fix duplicate dispersion cron name [puppet] - 10https://gerrit.wikimedia.org/r/362950 (https://phabricator.wikimedia.org/T151648) [08:38:27] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6914/" [puppet] - 10https://gerrit.wikimedia.org/r/362949 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [08:38:50] (03CR) 10Filippo Giunchedi: [C: 032] swift: fix duplicate dispersion cron name [puppet] - 10https://gerrit.wikimedia.org/r/362950 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [08:39:01] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable swift storage policies in codfw [puppet] - 10https://gerrit.wikimedia.org/r/362949 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [08:39:31] 10Operations, 10Puppet, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3399834 (10Joe) First hurdle: puppetlabs advises to set the variable `environment_timeout` to `unlimited` and to restart the puppetmaster at ev... [08:40:00] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:42:23] (03PS7) 10Elukey: Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [08:42:43] 10Operations, 10Puppet, 10User-Joe: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493#3399848 (10Joe) [08:44:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3398609 (10jcrespo) @Marostegui That doesn't work- older hosts have 300GB disks- older but not so much have 600GB ones. [08:44:53] (03PS8) 10Elukey: role::analytics_cluster::refinery::job::data_drop: drop old druid data [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [08:46:43] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3399875 (10MoritzMuehlenhoff) Which NFS services/processes caused this? [08:49:24] !log unbanning elastic1020 from elasticsearch eqiad [08:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:27] 10Operations, 10Puppet, 10User-Joe: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493#3399904 (10Joe) The passenger docs about this are pretty clear: in the floss version of passenger, a restart is blocking - that is passenger will wait for all currently s... [08:51:47] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3399920 (10Marostegui) In this row, as important hosts we have to downtime: es2018 -> es1014 needs to be downtimed as it will page with replication broken db2029 is s7 codfw master (We might want to dow... [08:52:37] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3399923 (10Marostegui) Good catch @jcrespo - thank you. @Cmjohnson please advise if you ran out of 600GB spare disks. Thanks guys [08:55:25] (03PS24) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [08:58:55] <_joe_> !log restarting the passenger app on puppetmaster1002 for T169493 [08:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:07] T169493: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493 [09:03:48] (03CR) 10Volans: [C: 031] "Great job! And thanks for baring with all my comments, including the tedious and OCD-driven ones ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [09:07:30] <_joe_> !log restarting the passenger app on puppetmasters in codfw serially with a sleep of 3 seconds for T169493 [09:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:41] T169493: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493 [09:09:30] 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3400000 (10Marostegui) [09:09:33] 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3400015 (10Marostegui) p:05Triage>03Normal [09:12:41] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3400023 (10elukey) For kafka2002 it is sufficient to depool it from eventbus via pybal/conftool, and then re-balance the cluster when the work is done (will take care of the two steps). [09:13:11] 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3400000 (10jcrespo) We may want to hold this, at least unless a switch is planned- hosts 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3400043 (10Marostegui) 05Open>03stalled [09:20:11] (03CR) 10Muehlenhoff: wikimania_scholarships: add support for stretch and PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362137 (owner: 10Dzahn) [09:24:06] !log Deploy alter table on s1 directly on codfw master (db2016) and let it replicate - T166204 [09:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:17] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [09:26:33] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3400075 (10elukey) During the last upgrade we have generated a lot of pages and alerts on IRC, that is expected since the maintenance needed reboots of all the row's switches. What I am wondering is if w... [09:30:43] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3400081 (10alanajjar) @Marostegui start it now? [09:33:16] @Marostegui [09:33:17] https://phabricator.wikimedia.org/T169396 [09:34:28] Alaa: Give me a sec to open all the stuff to monitor it [09:34:42] okay, all the time with you :) [09:37:14] !log Global rename of Markos90 → Mαρκος - T169396 [09:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:24] T169396: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396 [09:38:07] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3400117 (10alanajjar) We started https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/M%CE%B1%CF%81%CE%BA%CE%BF%CF%82 [09:44:00] PROBLEM - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.35.26.194 [09:44:20] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [09:44:51] RECOVERY - Juniper alarms on mr1-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [09:45:20] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [09:48:45] 10Operations, 10hardware-requests: Reclaim/Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3400130 (10MoritzMuehlenhoff) [09:51:44] 10Operations, 10Puppet, 10User-Joe: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493#3400151 (10Joe) So all my attempts at restarting one or multiple instances of the puppetmaster backends luckily didn't cause any puppet errors, but it takes up to 4 minut... [09:51:55] 10Operations, 10Puppet, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3400153 (10Joe) [09:51:57] 10Operations, 10Puppet, 10User-Joe: Test restarting puppetmaster workers for every code deploy - https://phabricator.wikimedia.org/T169493#3400152 (10Joe) 05Open>03declined [09:55:43] marostegui: you around? :) [09:56:06] yep [09:56:35] marostegui: have time to supervise a global rename? [09:56:44] TabbyCat: We are currently doing another one - T169396 [09:56:45] T169396: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396 [09:56:53] TabbyCat: So better to wait until that is finished :) [09:57:06] ah, I wanted to do that one heh [09:57:17] TabbyCat: Alaa took care of it :) [09:57:19] if it is in progress already then I said nothing [09:57:29] great :) [09:57:34] hehe cool - thanks! [09:57:55] :) [10:03:26] (03PS1) 10Filippo Giunchedi: hieradata: enable swift storage policies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/362958 (https://phabricator.wikimedia.org/T151648) [10:03:28] (03PS1) 10Filippo Giunchedi: swift: delete swift-object-reconstructor unit [puppet] - 10https://gerrit.wikimedia.org/r/362959 (https://phabricator.wikimedia.org/T151648) [10:08:34] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: add dependency on virtualenv [puppet] - 10https://gerrit.wikimedia.org/r/362960 [10:08:55] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::docker::builder: add dependency on virtualenv [puppet] - 10https://gerrit.wikimedia.org/r/362960 (owner: 10Giuseppe Lavagetto) [10:11:08] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3400169 (10Marostegui) a:03alanajjar [10:12:22] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable swift storage policies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/362958 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [10:12:26] (03PS2) 10Filippo Giunchedi: hieradata: enable swift storage policies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/362958 (https://phabricator.wikimedia.org/T151648) [10:13:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: enable swift storage policies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/362958 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [10:16:53] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3400195 (10alanajjar) Finished [10:17:11] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3400211 (10alanajjar) 05Open>03Resolved [10:18:08] Thanks Marostegui :) [10:18:29] thank you! [10:19:20] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:03] (03PS2) 10Filippo Giunchedi: swift: delete swift-object-reconstructor unit [puppet] - 10https://gerrit.wikimedia.org/r/362959 (https://phabricator.wikimedia.org/T151648) [10:30:11] (03CR) 10Filippo Giunchedi: [C: 032] swift: delete swift-object-reconstructor unit [puppet] - 10https://gerrit.wikimedia.org/r/362959 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [10:39:57] !log rebooting mw1298 for kernel update [10:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:33] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Implement storage policies for swift - https://phabricator.wikimedia.org/T151648#3400307 (10fgiunchedi) [10:44:40] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3400320 (10ayounsi) During the first upgrade, despite the switch being downtimed in Icinga, the upgrade process didn't make it go down (as in it was still replying to pings) so all the hosts depending on... [10:46:22] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3400335 (10jcrespo) a:05jcrespo>03None [10:47:20] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:56:32] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3400378 (10jcrespo) [10:59:44] 10Operations, 10ops-esams: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3400383 (10fgiunchedi) [11:00:09] (03PS1) 10Filippo Giunchedi: Decom swift cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/362965 (https://phabricator.wikimedia.org/T169518) [11:04:54] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/6915/" [puppet] - 10https://gerrit.wikimedia.org/r/362965 (https://phabricator.wikimedia.org/T169518) (owner: 10Filippo Giunchedi) [11:06:50] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:12:57] (03PS1) 10Filippo Giunchedi: Decom ms-fe.svc.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/362968 (https://phabricator.wikimedia.org/T169518) [11:14:31] (03CR) 10Filippo Giunchedi: [C: 032] Decom ms-fe.svc.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/362968 (https://phabricator.wikimedia.org/T169518) (owner: 10Filippo Giunchedi) [11:17:50] 10Operations, 10ops-esams, 10DC-Ops: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#3400441 (10fgiunchedi) 05Open>03declined See {T169518} [11:18:01] 10Operations, 10ops-esams, 10DC-Ops: setup the 2 new esams ms-be systems - https://phabricator.wikimedia.org/T86784#3400445 (10fgiunchedi) 05Open>03declined See {T169518} [11:18:38] 10Operations, 10ops-esams, 10DC-Ops: ms-be3003 sdk (bay 11) broken - https://phabricator.wikimedia.org/T83811#3400449 (10fgiunchedi) 05Open>03Invalid See {T169518} [11:26:21] 10Operations, 10media-storage, 10Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#3400483 (10fgiunchedi) [12:01:41] (03PS1) 10Elukey: Set stat1005's pxe boot option to stretch [puppet] - 10https://gerrit.wikimedia.org/r/362975 (https://phabricator.wikimedia.org/T165368) [12:02:12] (03CR) 10Elukey: [V: 032 C: 032] Set stat1005's pxe boot option to stretch [puppet] - 10https://gerrit.wikimedia.org/r/362975 (https://phabricator.wikimedia.org/T165368) (owner: 10Elukey) [12:03:36] !log rebooting scb1004 for kernel update [12:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:51] jouncebot: refresh [12:13:54] I refreshed my knowledge about deployments. [12:13:55] jouncebot: next [12:13:55] In 48 hour(s) and 46 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300) [12:16:20] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [12:16:50] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3400682 (10Marostegui) Taking db1102 for: T169510 [12:17:15] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3400684 (10Marostegui) [12:18:04] (03PS2) 10Hashar: Reopen nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361686 (https://phabricator.wikimedia.org/T168764) (owner: 10Urbanecm) [12:25:53] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3400700 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['stat1005.... [12:26:18] !log reimage stat1005 with Debian Stretch [12:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:02] (03PS1) 10Marostegui: db-codfw.php: Add status for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362978 (https://phabricator.wikimedia.org/T169510) [12:31:45] (03PS1) 10Faidon Liambotis: Revert "base: cleanup unneeded ipmi packages/checks" [puppet] - 10https://gerrit.wikimedia.org/r/362980 [12:32:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add status for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362978 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [12:32:46] (03Draft1) 10Paladox: Gerrit: Add tag url to gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362979 [12:32:48] (03PS2) 10Paladox: Gerrit: Add tag url to gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362979 [12:33:09] (03CR) 10Paladox: "The only thing I'm not sure of is will this break gerrit 2.13?" [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [12:33:15] (03Merged) 10jenkins-bot: db-codfw.php: Add status for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362978 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [12:33:27] (03CR) 10jenkins-bot: db-codfw.php: Add status for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362978 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [12:36:11] (03CR) 10Alexandros Kosiaris: [C: 04-2] "See https://github.com/lamby/pkg-gunicorn/blob/debian/sid/debian/NEWS" [puppet] - 10https://gerrit.wikimedia.org/r/362601 (owner: 10Paladox) [12:36:38] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add comments about db2056 status - T169510 (duration: 02m 50s) [12:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [12:39:59] !log Compress innodb on db2056 - T169510 [12:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:50] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=703.80 Read Requests/Sec=367.00 Write Requests/Sec=50.20 KBytes Read/Sec=46877.60 KBytes_Written/Sec=420.40 [12:46:50] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=148.20 Read Requests/Sec=110.50 Write Requests/Sec=24.10 KBytes Read/Sec=1486.80 KBytes_Written/Sec=466.40 [12:48:46] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3400760 (10jcrespo) Marostegui, quick question, do you know what is the state of this- are those other servers still not going up, do you want me to have a third quick look in case it is a... [12:50:35] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3400761 (10Marostegui) >>! In T162233#3400760, @jcrespo wrote: > Marostegui, quick question, do you know what is the state of this- are those other servers still not going up, do you want m... [12:55:17] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3400764 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['stat1005.eqiad.wmnet'] ``` Of which those **FAILED**: ```... [13:02:12] jouncebot: next [13:02:12] In 47 hour(s) and 57 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300) [13:02:30] jouncebot is bugged [13:02:35] or is it me [13:02:51] ah no that is "no deploy on July 3rd" [13:09:14] (03PS2) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/362421 [13:10:41] (03PS1) 10Giuseppe Lavagetto: role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985 [13:12:20] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:12:56] (03PS3) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/362421 [13:13:14] (03CR) 10Paladox: Gerrit: Add tag url to gitweb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [13:13:16] (03PS1) 10WMDE-leszek: Set Wikibase readFullEntityIdColumn to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 [13:14:05] (03CR) 10Paladox: Gerrit: Add tag url to gitweb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [13:14:19] (03PS2) 10WMDE-leszek: Set Wikibase readFullEntityIdColumn setting to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 [13:15:27] (03Draft1) 10Paladox: Gerrit: Remove linkDrafts from gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362987 [13:15:30] (03PS2) 10Paladox: Gerrit: Remove linkDrafts from gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362987 [13:18:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=574.30 Read Requests/Sec=357.30 Write Requests/Sec=14.70 KBytes Read/Sec=41944.80 KBytes_Written/Sec=387.20 [13:19:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=228.10 Read Requests/Sec=218.10 Write Requests/Sec=21.00 KBytes Read/Sec=7413.60 KBytes_Written/Sec=920.40 [13:25:42] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3400800 (10Papaul) @mobrovac Dell Tech will be on site today at 9:45 am CDT for motherboard replacement. Can you please put the system in maintenance mode? Thanks. [13:25:53] 10Operations, 10Operations-Software-Development, 10Goal, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3400801 (10Volans) [13:28:08] 10Operations, 10Operations-Software-Development, 10Goal, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3400802 (10Volans) [13:39:39] !log uploaded apache2 2.4.10+deb8u9+wmf1 to apt.wikimedia.org/jessie-wikimedia [13:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:07] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400808 (10alanajjar) [13:40:40] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400808 (10Marostegui) @alanajjar if you want to do it now, I am available [13:42:17] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400822 (10alanajjar) [13:42:48] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400808 (10alanajjar) @Marostegui. of course, let us start! [13:43:06] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400825 (10Marostegui) Go ahead! [13:43:51] !log Global rename of Antero de Quintal → JMagalhães - T169527 [13:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:59] T169527: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527 [13:45:19] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400845 (10alanajjar) We start. [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/JMagalh%C3%A3es |The log]] [13:45:50] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400846 (10alanajjar) a:03Marostegui [13:45:56] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400847 (10Marostegui) >>! In T169527#3400845, @alanajjar wrote: > We start. > > [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/JMag... [13:48:47] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/362421 (owner: 10Alexandros Kosiaris) [13:52:00] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:20] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:30] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2002029 [13:52:31] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:56] akosiaris: $facts["ipmi_lan"] is :undef, not a hash or array [13:53:20] these are VMS [13:53:24] VMs [13:53:30] why on earth ... [13:53:30] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:40] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:00] PROBLEM - puppet last run on kubetcd2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:00] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:25] !log rebooting ms-be2028 to ms-be2035 for kernel update [13:54:30] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:30] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:40] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:47] hmmm [13:54:50] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:04] is there an apache update? Apt-get upgrade shows apache2 is being upgraded. But i thought that it breaks for us? [13:55:40] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:41] paladox: see SAL, moritz uploaded one few minutes ago [13:55:49] ah, ok. thanks :) [13:56:02] akosiaris: the has_ipmi facts is not there on VMs [13:56:08] is not false, just not there at all [13:56:30] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:31] volans: no.. it's $facts.. not facts [13:56:33] sigh... [13:56:34] typo [13:56:35] fixing [13:56:48] akosiaris: tx :) [13:56:53] oh damn, right, looking at the code now :D [13:57:00] PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:40] PROBLEM - puppet last run on kubetcd2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:41] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:10] * volans ready to re-run puppet on failed ones as sson as the fix is merged [13:58:20] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:28] (03PS1) 10Marostegui: db2056.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/362991 (https://phabricator.wikimedia.org/T148507) [13:58:30] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:37] (03PS1) 10Alexandros Kosiaris: Fix typo with forgotten $ in monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/362992 [13:58:37] volans: should be easy.. it will be all VMs [13:58:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=761.70 Read Requests/Sec=374.80 Write Requests/Sec=1.20 KBytes Read/Sec=46391.60 KBytes_Written/Sec=25.20 [13:58:46] (03CR) 10Marostegui: [C: 04-2] "Do not merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/362991 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [13:58:52] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo with forgotten $ in monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/362992 (owner: 10Alexandros Kosiaris) [13:58:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix typo with forgotten $ in monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/362992 (owner: 10Alexandros Kosiaris) [13:59:18] I like how PCC missed this [13:59:20] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:22] I like how I missed this [13:59:28] how jenkins missed this [13:59:30] really jolly [13:59:40] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:47] akosiaris: yeah in this case is easier but I like to use the --failed-only ;) [13:59:53] sudo cumin -b 10 -p 95 'F:is_virtual = true' 'run-puppet-agent --failed-only' [14:00:00] it could be run on * fwiw [14:00:12] akosiaris: merged? [14:00:18] just finished [14:00:22] shoot [14:00:30] shooting :D [14:00:30] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:40] PROBLEM - puppet last run on kubetcd2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:40] PROBLEM - puppet last run on roentgenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:41] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:48] (03PS1) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:00:48] ok kraz is happy this time around [14:01:00] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [14:01:09] (03PS2) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:01:20] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:01:20] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:01:30] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:01:30] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:01:30] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:01:30] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:01:30] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:01:40] RECOVERY - puppet last run on kubetcd2002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:01:40] RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:01:40] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:01:40] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:01:40] RECOVERY - puppet last run on roentgenium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:01:41] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:01:41] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:01:42] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:01:50] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:02:00] RECOVERY - puppet last run on kubetcd2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:02:00] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:02:00] RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:02:20] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:02:31] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:02:32] akosiaris: all good [14:02:37] :-) [14:02:40] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:02:50] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:03:04] (03CR) 10jerkins-bot: [V: 04-1] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:04:58] (03PS3) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:05:50] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.50 Read Requests/Sec=13.50 Write Requests/Sec=32.10 KBytes Read/Sec=199.60 KBytes_Written/Sec=6976.40 [14:06:05] (03CR) 10jerkins-bot: [V: 04-1] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:06:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] servermon: Add gunicorn.service systemd script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:06:28] _joe_: I got a strange output from run-puppet-agent --failed-only [14:06:33] (03PS4) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:06:49] it failed on a host where puppet was running manually by alex, I got: [14:06:53] Notice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [14:07:02] kraz ? [14:07:04] yep [14:07:15] expected then [14:07:21] in theory it should have waited your run and then run mine [14:07:33] it does a wait_for_puppet [14:07:36] hmmm [14:07:42] <_joe_> volans: you might have caught one very hard to find race? [14:07:50] (03CR) 10jerkins-bot: [V: 04-1] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:07:51] it might indeed :D [14:08:03] <_joe_> like akosiaris started his run and your check just failed a fraction of a second before [14:08:24] :D [14:08:28] <_joe_> so in fact the end result is what you expect, just not getting the right output [14:08:31] * volans looking at kraz logs to see if there is some evidence [14:08:39] <_joe_> please do [14:08:48] * akosiaris runs puppet on einsteinium to see what happened [14:09:39] (03PS1) 10Marostegui: site.pp: Add db1102 sanitarium role [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) [14:09:55] _joe_: seems indeed a hard to find race, same second start [14:09:56] Jul 3 14:00:33 kraz puppet-agent[7765]: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [14:09:59] Jul 3 14:00:33 kraz puppet-agent[7735]: Retrieving pluginfacts [14:10:01] (03PS8) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [14:10:25] (03CR) 10Paladox: servermon: Add gunicorn.service systemd script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:11:14] (03PS5) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:11:46] sigh... icinga doesn't love me [14:11:59] (03CR) 10Marostegui: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/6920/" [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [14:12:23] (03CR) 10jerkins-bot: [V: 04-1] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:12:24] _joe_: so it's the time of the last_run_success check, after the wait_for_puppet, on kraz it takes ~90ms [14:12:44] enough to get a puppet run started in the middle [14:13:00] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [14:13:05] <_joe_> so my point stands :) [14:13:11] <_joe_> akosiaris: ^^ icinga! [14:13:17] yeah, the final result is correct [14:13:26] yeah yeah, known [14:13:39] (03CR) 10Ayounsi: Add ferm service for rpc.mountd on labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356347 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [14:13:47] (03PS6) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:14:11] 10Operations, 10User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3400958 (10fgiunchedi) [14:14:13] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3400956 (10fgiunchedi) 05Open>03Resolved This is completed as far as the cleanup is concerned, I've started https://wikitech.wikimedia.org/wik... [14:14:24] akosiaris: thanks for the perfect timing that unveiled the race condition btw ;) [14:14:43] (03CR) 10jerkins-bot: [V: 04-1] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:15:13] (03CR) 10Ayounsi: Add ferm service for rpc.statd on labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [14:15:47] (03PS7) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:16:08] :-) [14:17:09] 10Operations, 10Prometheus-metrics-monitoring, 10User-Elukey: Create prometheus nutcracker exporter - https://phabricator.wikimedia.org/T155129#3400974 (10elukey) a:05elukey>03None [14:17:18] 10Operations, 10Prometheus-metrics-monitoring, 10User-Elukey: Create prometheus nutcracker exporter - https://phabricator.wikimedia.org/T155129#2934402 (10elukey) p:05Triage>03Low [14:17:30] (03CR) 10Ayounsi: "rpcbind also listens on 954/udp. I can't find informations on what's it for, but mentioning it here just in case." [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [14:18:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:18:54] (03PS1) 10Alexandros Kosiaris: Remove extraneous m char from checkcommands.cfg [puppet] - 10https://gerrit.wikimedia.org/r/362998 [14:19:12] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400983 (10alanajjar) Finished. Thanks @Marostegui [14:19:36] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400984 (10alanajjar) 05Open>03Resolved [14:19:45] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400986 (10Marostegui) Thank you!! :-) [14:21:10] (03CR) 10Alexandros Kosiaris: [C: 032] Remove extraneous m char from checkcommands.cfg [puppet] - 10https://gerrit.wikimedia.org/r/362998 (owner: 10Alexandros Kosiaris) [14:21:27] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Antero de Quintal → JMagalhães: supervision needed - https://phabricator.wikimedia.org/T169527#3400992 (10alanajjar) >>! In T169527#3400986, @Marostegui wrote: > Thank you!! :-) :) [14:25:58] (03PS8) 10Rush: labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 [14:30:55] PROBLEM - Host apertium.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host aqs.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-a-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-a-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-c-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-c-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-d-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:00] PROBLEM - Host asw-esams.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:01] PROBLEM - Host asw-ulsfo.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:01] PROBLEM - Host asw2-a5-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:02] PROBLEM - Host blog.wikimedia.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:02] PROBLEM - Host asw2-d-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:03] PROBLEM - Host checker.tools.wmflabs.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:05] PROBLEM - Host citoid.svc.codfw.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:10] PROBLEM - Host citoid.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:10] PROBLEM - Host commons.wikimedia.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:10] PROBLEM - Host [uconf1003.mgmt.eqiad.wmnet] is DOWN: PING CRITICAL - Packet loss = 100% [14:31:11] PROBLEM - Host [ucp1045.mgmt.eqiad.wmnet] is DOWN: PING CRITICAL - Packet loss = 100% [14:31:11] PROBLEM - Host cr1-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:11] PROBLEM - Host cr1-eqord IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:18] what's going on> [14:31:22] :| [14:31:25] rash of alerts [14:31:25] <_joe_> wtf? [14:31:31] PROBLEM - Host cr2-esams IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host cr2-esams.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host cr2-knams IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host cr2-knams.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host cr2-ulsfo IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host cr2-ulsfo.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] PROBLEM - Host csw2-esams.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:33] on mgmt network I guess [14:31:36] PROBLEM - Host cxserver.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:36] PROBLEM - Host [udb1063.mgmt.eqiad.wmnet] is DOWN: PING CRITICAL - Packet loss = 100% [14:31:37] XioNoX ping [14:31:37] they are mgmt networks [14:31:41] PROBLEM - Host phab.wmfusercontent.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-a4-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-a7-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-b3-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-c5-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-d1-codfw.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:41] PROBLEM - Host ps1-d4-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:46] PROBLEM - Host eventstreams.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:51] <_joe_> not just those [14:31:51] PROBLEM - Host graphoid.svc.codfw.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:51] PROBLEM - Host google.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:56] PROBLEM - Host graphoid.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:01] PROBLEM - Host thumbor.svc.codfw.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:01] PROBLEM - Host tools.wmflabs.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:12] this is weird tools.wmflabs.org.mgmt.eqiad.wmnet [14:32:12] akosiaris: might be related to yours? [14:32:25] what on earth are these ? [14:32:26] PROBLEM - Host [ukafka1018.mgmt.eqiad.wmnet] is DOWN: PING CRITICAL - Packet loss = 100% [14:32:26] PROBLEM - Host [ukafka1020.mgmt.eqiad.wmnet] is DOWN: PING CRITICAL - Packet loss = 100% [14:32:26] marostegui: akosiaris sent a warning about monitoring [14:32:27] <_joe_> yes [14:32:31] yeah... [14:32:32] I think this is your change, akosiaris [14:32:34] <_joe_> ahahahahah [14:32:37] but he said not paging [14:32:40] PROBLEM - Host misc-web-lb.codfw.wikimedia.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:41] soemthing isn't right there w/ what it's looking at [14:32:45] PROBLEM - Host misc-web-lb.eqiad.wikimedia.org_ipv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:50] PROBLEM - Host misc-web-lb.codfw.wikimedia.org_ipv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:50] hosts with [] brakets [14:32:58] PROBLEM - Host misc-web-lb.esams.wikimedia.org_ipv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:58] (03PS1) 10Aklapper: phabricator: Block IP ranges for recent uploaded offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/363001 [14:32:59] PROBLEM - Host mobileapps.svc.codfw.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:04] PROBLEM - Host misc-web-lb.esams.wikimedia.org.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mobileapps.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mr1-codfw.oob.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mr1-eqiad IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mr1-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mr1-esams IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host mr1-eqiad.oob IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:10] PROBLEM - Host mr1-eqiad.oob.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:10] PROBLEM - Host mr1-esams.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:11] PROBLEM - Host mr1-ulsfo IPv6.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:11] PROBLEM - Host mr1-ulsfo.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:12] PROBLEM - Host mr1-ulsfo.oob.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:15] PROBLEM - Host ms-fe.svc.eqiad.wmnet.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:33:22] someone is reverting? [14:33:26] !log set enable_notification=0 in icinga [14:33:34] akosiaris: heh, I was about to [14:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:49] chasemp: revert wouldn't do it, these are collected resources [14:33:54] why on earth did these got populated ? [14:33:56] * chasemp nods [14:33:58] ah right [14:34:12] are those "virtual" hosts? [14:34:23] and what's with the alias too ? [14:34:25] like dns monitoring points only? [14:34:27] + alias [u'wtp1037.mgmt.eqiad.wmnet'] [14:34:30] akosiaris: also thins likes host_name 208.80.153.12.mgmt.codfw.wmnet [14:34:30] (03CR) 10Rush: [C: 032] labcontrol: add base::firewall to new servers [puppet] - 10https://gerrit.wikimedia.org/r/362993 (owner: 10Rush) [14:34:58] ah the virtual LVSes ? [14:35:00] sigh [14:35:04] and a few other things [14:35:11] I think we use monitoring::host for a bunch of stuff [14:35:17] <_joe_> yeah just revert :) [14:35:19] and that automatically propagated to all of them [14:35:24] yes it did [14:35:32] I 'll revert... this needs more thinking [14:35:50] LVS service IPs, netops monitoring, plus some other weird stuff [14:36:09] yeah also like "mr1-eqiad.oob IPv6.mgmt.eqiad.wmnet" [14:37:18] damn monitoring::host [14:37:30] that's a module that needs rspec tests after all [14:37:35] (03CR) 10Matthias Mullie: Add 3d2png deploy repo to image scalers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [14:37:38] (03PS1) 10Alexandros Kosiaris: Revert "monitoring::host: Monitor IPMI as well if applicable" [puppet] - 10https://gerrit.wikimedia.org/r/363002 [14:38:15] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "monitoring::host: Monitor IPMI as well if applicable" [puppet] - 10https://gerrit.wikimedia.org/r/363002 (owner: 10Alexandros Kosiaris) [14:38:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "monitoring::host: Monitor IPMI as well if applicable" [puppet] - 10https://gerrit.wikimedia.org/r/363002 (owner: 10Alexandros Kosiaris) [14:40:41] !log running EventLogging alter tables on dbstore1002 (script in /home/elukey/dbstore1002.sql) - T167162 [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:52] T167162: Make non-nullable columns in EL database nullable - https://phabricator.wikimedia.org/T167162 [14:45:37] (03CR) 10Alexandros Kosiaris: servermon: Add gunicorn.service systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:46:21] (03CR) 10Jcrespo: [C: 04-1] "Let's wait for icinga to recover." [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [14:46:32] (03PS9) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [14:46:34] (03CR) 10Paladox: servermon: Add gunicorn.service systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:46:55] (03CR) 10Jcrespo: [C: 04-1] "Also should we reimage as stretch?" [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [14:51:13] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3401106 (10mobrovac) {{done}} @Papaul, you are good to go! [14:52:41] (03CR) 10Marostegui: "> Also should we reimage as stretch?" [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [14:53:14] (03CR) 10Jcrespo: [C: 04-1] "Yeah" [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [14:53:21] (03CR) 10Alexandros Kosiaris: "I definitely did not say that. I said" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:55:04] (03CR) 10Paladox: "> I definitely did not say that. I said" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:55:39] (03CR) 10Paladox: "> I definitely did not say that. I said" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [14:55:47] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3401109 (10chasemp) @Papaul, can you get this knocked out this week? (some of our goals this quarter will depend on this it seems) [14:55:50] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3401111 (10chasemp) @Papaul, can you get this knocked out this week? (some of our goals this quarter will depend on this it seems) [14:55:53] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3401110 (10chasemp) @Papaul, can you get this knocked out this week? (some of our goals this quarter will depend on this it seems) [14:57:53] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3401118 (10chasemp) https://gerrit.wikimedia.org/r/#/c/362993/ [14:58:01] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3401131 (10Papaul) @chasemp working on it. [14:58:03] (03Abandoned) 10Rush: WIP: maintain-dbusers discussion strawman for doc system [puppet] - 10https://gerrit.wikimedia.org/r/355105 (owner: 10Rush) [14:58:17] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3401132 (10Papaul) @chasemp working on it. [14:58:31] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3401133 (10Papaul) @chasemp working on it. [14:59:13] (03CR) 10Rush: [C: 032] openstack/diamond: remove the libvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/362446 (owner: 10Faidon Liambotis) [14:59:20] (03PS2) 10Rush: openstack/diamond: remove the libvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/362446 (owner: 10Faidon Liambotis) [15:03:56] (03CR) 10Alexandros Kosiaris: "Here's the service.unit documentation. Type is the very first directive covered." [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:06:10] papaul: let me know once the maintenance of scb2005 is done [15:06:16] hello, could someone run a command in eval.php for me on plwiki? `echo json_encode( $wgRateLimits );`i'm investigating https://phabricator.wikimedia.org/T169268 [15:08:16] (03CR) 10Paladox: "> Here's the service.unit documentation. Type is the very first" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:08:34] (03PS10) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [15:12:27] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3401197 (10akosiaris) For the record, as far as I know, this is blocked on the service being deployed to beta first (... [15:12:58] !log labvirt1001 service nova-compute restart [15:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] (03PS2) 10Faidon Liambotis: phabricator: Block IP ranges for recent uploaded offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/363001 (owner: 10Aklapper) [15:15:09] (03CR) 10Faidon Liambotis: [C: 032] "Not thrilled by the whack-a-mole but better than doing nothing..." [puppet] - 10https://gerrit.wikimedia.org/r/363001 (owner: 10Aklapper) [15:15:29] (03CR) 10Faidon Liambotis: [V: 032 C: 032] phabricator: Block IP ranges for recent uploaded offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/363001 (owner: 10Aklapper) [15:16:10] !log labcontrol1001 clean out admin-monitoring leaks [15:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:48] apergos: can you help me? "could someone run a command in eval.php for me on plwiki? `echo json_encode( $wgRateLimits );`i'm investigating https://phabricator.wikimedia.org/T169268" [15:17:29] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb4] [15:18:18] sigh, I guess that ^ is T163673 [15:18:19] T163673: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 [15:18:29] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [15:19:51] 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3401229 (10Papaul) @Marostegui Proposal approved. [15:20:32] 10Operations, 10ops-codfw, 10DBA: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3401232 (10Marostegui) Thanks @Papaul - let's leave this stalled for now. We will ping you if we decide to go for it :-) [15:23:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:24:34] ema: --^ - a lot of codfw ints ? [15:24:39] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:25:15] elukey: looking [15:25:39] thanks! Currently in a meeting but let me know if you need help [15:26:39] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:30:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:30:51] !log cp1099: restart varnish-be [15:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:13] ah cp1099 on strike? [15:31:39] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:32:50] (03PS1) 10Urbanecm: Fix rate limit configuration for plwiki - ratelimit thanks-notification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363011 (https://phabricator.wikimedia.org/T169268) [15:32:59] elukey: yeah, but unrelated to the 5xx spike [15:33:13] ah okok [15:33:23] elukey: that was text codfw https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text [15:33:49] !log mobrovac@tin Started deploy [mobileapps/deploy@58a5b19]: Remove pronunciation from the spec - T169299 [15:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:58] T169299: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299 [15:34:11] (03CR) 10Jcrespo: site.pp: Add db1102 sanitarium role [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [15:34:39] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:35:29] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [15:35:29] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [15:35:51] mobrovac: \o/ [15:36:01] :) [15:37:09] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [15:37:29] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [15:37:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:37:29] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [15:37:59] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [15:38:09] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:38:29] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [15:38:49] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [15:38:49] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [15:38:49] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [15:43:19] !log mobrovac@tin Finished deploy [mobileapps/deploy@58a5b19]: Remove pronunciation from the spec - T169299 (duration: 09m 30s) [15:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] T169299: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299 [15:45:43] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (done): Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3401300 (10mobrovac) 05Open>03Resolved a:0... [15:52:18] 10Operations, 10Puppet: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3401323 (10Joe) [15:53:31] 10Operations, 10Puppet, 10puppet-compiler, 10User-Joe: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3401341 (10Joe) p:05Triage>03Normal a:03Joe [15:53:51] 10Operations, 10Puppet, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3401346 (10Joe) [15:53:53] 10Operations, 10Puppet, 10puppet-compiler, 10User-Joe: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3401323 (10Joe) [15:54:33] 10Operations, 10Puppet, 10puppet-compiler, 10User-Joe: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3401323 (10Joe) [15:57:26] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3401370 (10Joe) [15:59:13] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3401385 (10Joe) [16:02:08] !log labvirt1002:~# service nova-compute restart [16:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:58] (03PS1) 10Dzahn: admins: contract renewal for shrlak until Jan 2018 [puppet] - 10https://gerrit.wikimedia.org/r/363017 [16:03:06] 10Operations, 10Puppet: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066#3401400 (10Paladox) Me and @hashar was talking about this a few months ago. I've created this https://gerrit.wikimedia.org/r/#/c/325064/ to get jjb to generate the job and also created... [16:10:27] (03CR) 10Muehlenhoff: [C: 032] admins: contract renewal for shrlak until Jan 2018 [puppet] - 10https://gerrit.wikimedia.org/r/363017 (owner: 10Dzahn) [16:10:46] (03PS2) 10Muehlenhoff: admins: contract renewal for shrlak until Jan 2018 [puppet] - 10https://gerrit.wikimedia.org/r/363017 (owner: 10Dzahn) [16:18:55] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3401485 (10Papaul) @mobrovac The main board replacement complete. Let me know when ready to plug the network cable in NIC 1 [16:20:10] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3401489 (10mobrovac) @Papaul feel free to proceed [16:23:25] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3401523 (10akosiaris) [16:24:54] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3401529 (10Papaul) a:05Papaul>03mobrovac @mobrovac done please test while I am on site. Assigning you back the task if everything looks good you can resolve the task. Thanks. [16:25:06] 10Operations, 10Puppet, 10puppet-compiler, 10User-Joe: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3401533 (10akosiaris) [16:25:08] 10Operations, 10Puppet, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3401534 (10akosiaris) [16:25:11] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3401535 (10akosiaris) [16:25:14] 10Operations, 10Puppet: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066#3401536 (10akosiaris) [16:25:15] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3401370 (10akosiaris) [16:26:09] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:27:42] papaul: the host seems to still be down [16:28:00] fixed stat1005, reset-failed puppet.service [16:28:09] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [16:28:17] (03PS1) 10Papaul: DNS: Add production DNS entries for labtestservices2003,labtestcontrol2003,labtestservices2002 and labtestmetal2001 [dns] - 10https://gerrit.wikimedia.org/r/363022 [16:28:33] elukey: can you please add a subtask to T148814 for it? [16:28:33] T148814: wmf-auto-reimage improvements - https://phabricator.wikimedia.org/T148814 [16:30:34] mobrovac: have you configure NIC1? [16:31:16] marostegui: can you configure the nic for scb2005 - re T167763 [16:31:17] T167763: Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763 [16:31:58] volans: sure [16:32:56] mobrovac: if the motherboard was replaced, probably you need to remove the udev 70-persistent-net.rules file and reboot [16:33:19] marostegui: heh i don't have the perms [16:33:22] Ah [16:33:24] I will do it :) [16:34:27] thnx! [16:39:24] papaul: there is no link on eth1 (and also the server attempted to pxe boot, so i stopped before anything) [16:43:47] marostegui: checking [16:44:30] papaul: I can see one interface with link, but the OS says eth0, let me give it a reboot [16:45:38] ok [16:46:26] papaul: looks like it is back :) [16:46:38] marostegui: OK [16:46:38] mobrovac: you should be able to login now [16:46:53] papaul: looks like the net persistent rules were playing a role _again_ [16:47:26] we are up :) [16:47:28] marostegui: thanks [16:47:40] it was weird, eth0 got the eth1 mac on the file, so i just deleted eth0 line and let it populate again after the reboot [16:47:46] mobrovac: good! [16:51:05] !log mobrovac@tin Started deploy [mobileapps/deploy@58a5b19]: (no justification provided) [16:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:47] !log mobrovac@tin Finished deploy [mobileapps/deploy@58a5b19]: (no justification provided) (duration: 00m 41s) [16:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:29] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [16:54:27] 10Operations, 10ops-codfw, 10Services (watching): Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3401716 (10mobrovac) 05Open>03Resolved Ok, `scb2005` is up and working, services have been repooled. Thank you @Papaul and @Marostegui ! [16:55:16] !log Running maintain-views --all-databases --clean --replace-all --debug on labsdb1001 [16:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:49] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:39] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 78920 bytes in 0.141 second response time [17:32:39] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [17:33:21] that's me [17:33:29] (maybe) [17:33:32] * volans checking anyway [17:33:43] (03PS2) 10Dzahn: Add awight to ores-admins [puppet] - 10https://gerrit.wikimedia.org/r/361593 (https://phabricator.wikimedia.org/T168442) (owner: 10Ladsgroup) [17:33:57] (03PS3) 10Dzahn: admins: Add awight to ores-admins [puppet] - 10https://gerrit.wikimedia.org/r/361593 (https://phabricator.wikimedia.org/T168442) (owner: 10Ladsgroup) [17:34:10] (03CR) 10Dzahn: [C: 032] "has been approved in ops meeting today" [puppet] - 10https://gerrit.wikimedia.org/r/361593 (https://phabricator.wikimedia.org/T168442) (owner: 10Ladsgroup) [17:35:04] !log labvirt1003:~# service nova-compute restart [17:35:07] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3401854 (10GWicke) To @bblack's concern about scoping on projects vs. all *.wikimedia.org domains: So far, both the portal at www.wikimedia.... [17:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:39] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:35:48] fixed [17:36:02] (03PS1) 10Papaul: DHCP: Add DHCP entries for labtestservices2003,labtestcontrol2003,labtestservices2002 and labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/363039 [17:39:17] 10Operations: MD RAID: remove mdadm daily check - https://phabricator.wikimedia.org/T169564#3401863 (10Volans) [17:39:26] 10Operations: MD RAID: remove mdadm daily check - https://phabricator.wikimedia.org/T169564#3401877 (10Volans) p:05Triage>03Normal [17:43:11] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3364601 (10Dzahn) You have been added to the ores-admins group. This was approved in today's ops meeting. This gives you access to... [17:43:46] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3401883 (10Dzahn) 05Open>03Resolved [17:47:19] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3401905 (10chasemp) p:05Triage>03High >>! In T169286#3393637, @Andrew wrote: > I tagged dc-ops because... have y'all ever seen something like this? @Christopher ^ We had t... [17:47:55] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3401908 (10awight) Thanks! Confirmed working. [17:53:53] 10Operations, 10vm-requests: Site: (2) VM request for DMARC - https://phabricator.wikimedia.org/T169566#3401968 (10herron) [18:04:17] 10Operations, 10Labs: nfs-manage failover script needs to be tested with real load and fixed - https://phabricator.wikimedia.org/T169570#3402054 (10chasemp) [18:04:57] 10Operations, 10Labs: nfs-manage failover script needs to be tested with real load and fixed - https://phabricator.wikimedia.org/T169570#3402071 (10chasemp) `fuser -k` introduction or some such is possibly an addition? With nfs-kernel-server stopped file integrity issues from clients shouldn't be an issue but... [18:05:02] 10Operations, 10Labs: nfs-manage failover script needs to be tested with real load and fixed - https://phabricator.wikimedia.org/T169570#3402072 (10chasemp) p:05Triage>03High [18:17:40] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3402102 (10Volans) Opened T169564 for the mdadm configuration. [18:21:54] (03PS6) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [18:21:56] (03PS5) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [18:21:58] (03PS3) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [18:22:00] (03PS8) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [18:22:54] damn gerrit [18:23:02] (03CR) 10jerkins-bot: [V: 04-1] Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [18:28:56] (03PS7) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [18:29:48] (03CR) 10Volans: "restored PS 5" [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [18:33:21] (03PS2) 10RobH: DHCP: Add DHCP entries for labtestservices2003,labtestcontrol2003,labtestservices2002 and labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/363039 (owner: 10Papaul) [18:34:26] 10Operations, 10vm-requests: Site: (2) VM request for DMARC - https://phabricator.wikimedia.org/T169566#3402134 (10herron) [18:34:28] (03CR) 10RobH: [C: 032] DHCP: Add DHCP entries for labtestservices2003,labtestcontrol2003,labtestservices2002 and labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/363039 (owner: 10Papaul) [18:35:20] (03Draft1) 10Paladox: ORES: Add scap::target to create user deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/363042 [18:35:22] (03PS2) 10Paladox: ORES: Add scap::target to create user deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/363042 [18:39:07] marostegui: Hi [18:39:13] marostegui: still working? [18:39:19] (03PS8) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [18:39:21] (03PS6) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [18:39:23] (03PS4) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [18:39:25] (03PS9) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [18:39:39] * volans will start hating gerrit tonight [18:39:49] I'm surely doing something wrong on my side, but still [18:42:22] last attempt [18:42:30] (03PS7) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [18:42:32] (03PS5) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [18:42:36] (03PS10) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [18:45:30] (03PS9) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [18:46:07] (03PS3) 10Paladox: ORES: Add scap::target to create user deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/363042 [18:47:53] (03CR) 10Volans: [C: 032] ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [18:48:12] (03PS3) 10GWicke: Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) [18:48:24] (03Merged) 10jenkins-bot: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [18:48:45] (03PS4) 10Paladox: ORES: Create user deploy-service using user and group syntax [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) [18:49:22] (03PS4) 10GWicke: Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) [18:49:54] (03PS8) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [18:50:49] (03CR) 10Volans: [C: 032] CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [18:51:20] (03Merged) 10jenkins-bot: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [18:52:30] (03PS6) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [19:06:35] (03PS1) 10Ladsgroup: Enable WikiLove for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363043 (https://phabricator.wikimedia.org/T169563) [19:09:50] !log nuria@tin Started deploy [eventlogging/analytics@328dea6]: (no justification provided) [19:09:53] !log nuria@tin Finished deploy [eventlogging/analytics@328dea6]: (no justification provided) (duration: 00m 03s) [19:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:48] (03PS1) 10Herron: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 [19:21:05] (03PS1) 10Mobrovac: OCG: Do not use the INFO command as a readiness check [puppet] - 10https://gerrit.wikimedia.org/r/363045 [19:21:17] (03PS2) 10Dzahn: nagios_common: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 (owner: 10Herron) [19:21:25] (03CR) 10Dzahn: [C: 031] nagios_common: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 (owner: 10Herron) [19:23:04] (03CR) 10Herron: [C: 032] nagios_common: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 (owner: 10Herron) [19:27:08] (03PS2) 10Jforrester: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) [19:29:13] (03PS3) 10Herron: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 [19:29:22] (03CR) 10Ladsgroup: "I think the default value is false but it might change later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [19:29:25] !log restarting jenkins [19:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:58] (03CR) 10Ladsgroup: [C: 031] "Just saw that, sorry. Do you want this to be deployed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [19:30:44] (03CR) 10Herron: [C: 032] Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 (owner: 10Herron) [19:31:35] (03PS4) 10Herron: Add myself to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/363044 [19:40:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:41:08] (03CR) 10Awight: ORES: Create user deploy-service using user and group syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [19:41:38] (03CR) 10Ladsgroup: [C: 031] "Scheduled for deployment in prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [19:44:39] (03CR) 10Chad: "Alex: any chance you could have another look at this?" [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [19:49:02] (03CR) 10Dzahn: [C: 032] "the tickets for the 3 hosts in public network all say "any row", IPs are unused, everything matches, lgtm" [dns] - 10https://gerrit.wikimedia.org/r/363022 (owner: 10Papaul) [19:51:57] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10Dzahn) labtestcontrol2003.wikimedia.org has address 208.80.153.75 [19:52:05] (03PS2) 10Smalyshev: Add more units for conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362606 (https://phabricator.wikimedia.org/T168582) [19:52:25] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10Dzahn) labtestservices2003.wikimedia.org has address 208.80.153.109 [19:52:45] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10Dzahn) labtestservices2002.wikimedia.org has address 208.80.153.76 [19:53:16] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Dzahn) labtestmetal2001.codfw.wmnet has address 10.192.20.11 [19:55:43] (03PS1) 10Catrope: Enable Echo per-user blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363049 (https://phabricator.wikimedia.org/T150419) [19:57:09] (03CR) 10Catrope: [C: 04-2] "Not yet (blocked on wmf.8 being deployed to meta; scheduled for July 12th 23:00 UTC)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363049 (https://phabricator.wikimedia.org/T150419) (owner: 10Catrope) [20:01:15] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:01:24] (03CR) 10Dzahn: [C: 04-1] "it has been said on ops meeting today that we don't even need this anymore, so i guess that means -1 but we all like that it's not even ne" [puppet] - 10https://gerrit.wikimedia.org/r/360876 (owner: 10RobH) [20:01:40] (03Abandoned) 10RobH: adding rootdelay to jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/360876 (owner: 10RobH) [20:07:29] (03CR) 10Hashar: [C: 031] "I am all for it, but that needs extra caution and attention to push it." [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [20:17:15] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:19:30] ebernhardson, gehel ^ [20:20:15] Looking [20:22:19] (03CR) 10Dzahn: [C: 032] "soo.. i had some questions/comments on using -p with rsync or not, the direction of syncing and i would like to make the cron-part optiona" [puppet] - 10https://gerrit.wikimedia.org/r/361811 (owner: 10Chad) [20:22:29] (03PS3) 10Dzahn: Create rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/361811 (owner: 10Chad) [20:24:14] !log banning elastic1018 from elasticsearch eqiad clsuter [20:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:51] (03CR) 10Dzahn: [C: 032] "of course i love the idea of using one specialized but permanent class for this vs. using temp migration classes that we delete again. i c" [puppet] - 10https://gerrit.wikimedia.org/r/361811 (owner: 10Chad) [20:30:21] (03CR) 10Dzahn: "this would copy more than just releases, it also copies static websites like annual/endowment/static-bz that should stay on bromine.. i me" [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [20:32:24] (03CR) 10Dzahn: "i see there is ONE file (releases-header.html) which is outside /srv/org/wikimedia/releases/ but it's ALSO inside the dir, let me amend it" [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [20:39:16] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3402481 (10mobrovac) Deployment in BetaCluster is {{done}}, cf T148129#3402480 . [20:41:11] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:46:56] !log unbanning elastic1018 from elasticsearch eqiad cluster [20:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:48] (03PS5) 10Paladox: ORES: Create user deploy-service using user and group syntax [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) [20:55:49] (03CR) 10Paladox: ORES: Create user deploy-service using user and group syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [21:13:07] PROBLEM - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:13:27] PROBLEM - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:13:47] PROBLEM - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:14:07] PROBLEM - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [21:14:37] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused [21:15:17] PROBLEM - configured eth on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:15:27] PROBLEM - dhclient process on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:15:47] PROBLEM - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group [21:16:07] PROBLEM - nutcracker port on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:16:27] PROBLEM - nutcracker process on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:16:37] PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [21:16:37] PROBLEM - puppet last run on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:16:57] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:16:57] PROBLEM - salt-minion processes on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:17:17] PROBLEM - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:17:37] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:17:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:24:47] (03CR) 10Kaldari: [C: 031] Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [21:28:43] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3402635 (10RobH) a:05RobH>03Joe System has been reinstalled and puppet key signed. mw1161-1167 are job runners mw1180-1188 are apaches So this could easily add to either of those, not sure which is best. @joe,... [21:31:26] i just confirmed that mw1228 is depooled (inactive) [21:32:17] PROBLEM - IPMI Temperature on mw1228 is CRITICAL: Return code of 255 is out of bounds [21:33:23] * mutante disables notifications, ACKs them all, re-enables notifications [21:37:07] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.48 seconds [21:37:07] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.31 seconds [21:37:17] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.34 seconds [21:37:27] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.41 seconds [21:37:37] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.02 seconds [21:37:47] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.05 seconds [21:38:17] mw1228 is mine! [21:38:21] the slave lab is not [21:38:30] mutante: sorry, i had it in maint mode but i guess it ended [21:38:33] reimage [21:39:59] !next [21:40:06] jouncebot: next [21:40:06] In 39 hour(s) and 19 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300) [21:42:07] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 38.59 seconds [21:42:07] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 37.64 seconds [21:42:17] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 30.77 seconds [21:42:27] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:42:37] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [21:42:47] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [21:55:05] robh: i saw the reimage, no problem at all. i think the reimage itself removed the maintenance period [22:09:18] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3402687 (10Paladox) puppet 5 was released. https://puppet.com/blog/puppet-5-platform-released [22:21:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:21:17] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:21:22] oh yeah likely [22:21:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:28:17] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:29:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:30:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:44:17] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 493.62 seconds [22:44:17] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 493.34 seconds [22:44:27] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 498.39 seconds [22:44:37] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 507.44 seconds [22:44:47] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 509.36 seconds [22:44:57] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 470.45 seconds [22:48:17] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 51.11 seconds [22:48:17] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 50.82 seconds [22:48:27] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 41.08 seconds [22:48:37] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [22:48:47] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [22:48:58] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [23:01:30] fatal: update_ref failed for ref 'refs/heads/review/chad/361812': cannot lock ref 'refs/heads/review/chad/361812': is at f0bf26d293205e524f842856386e80a8ecd8baad but expected 87ddf7ebc5b578aff87c1b3d6c39aadb93016a92 [23:01:34] Successfully rebased and updated refs/heads/review/chad/361812. [23:01:40] wonders if it was fatal or succesful :) [23:01:55] where did you see that? [23:02:00] in my shell [23:02:18] i'm sure i messed it up sometime [23:02:31] only reason i comment at all is because i like the combo of "fatal" and "succesful" [23:02:35] did you clone every ref? [23:02:48] vi .git/config will show weather it's refs/* or refs/heads/* [23:03:41] ? it seems every line is refs/heads/production [23:11:57] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] [23:12:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [23:12:58] (03PS5) 10Dzahn: rsync: Use rsync::quickdatacopy to copy data between servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [23:15:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [23:18:38] (03PS6) 10Dzahn: releases: Use rsync::quickdatacopy to copy data between servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [23:18:58] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [23:20:17] (03PS7) 10Dzahn: releases: Use rsync::quickdatacopy to copy data between servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [23:20:30] (03CR) 10Dzahn: [C: 032] releases: Use rsync::quickdatacopy to copy data between servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [23:21:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:27:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [23:45:26] Amir1: https://gerrit.wikimedia.org/r/#/c/361801/ [23:45:55] Krinkle: hey, what's up about it? [23:46:12] Amir1: Your -1 still stands? Or would you be okay reconsidering? [23:46:38] Krinkle: well, the PM says we should remove it and not make it a redirect [23:47:15] Also if one op or security says it's okay, I'm fine to remove my -1 [23:48:23] Amir1: Okay, just wanted to check with you directly. I wouldn't oppose maintaining a redirect in principle (and would actually encourage it if this is a doctype-level url that used to be canonical in the past). But it seems in the current form a redirect would not be acceptable on techincal grounds, so that will require some changes on wikiba.se first. Either way, the current one doesn't work. so no harm done :) [23:49:00] Notice the use of wikidata.org for this url in my initial text at https://phabricator.wikimedia.org/T169023 [23:49:18] 10X more, so it's useful to work on for you guys, but that's a PM decision :) [23:49:41] (03PS1) 10Dzahn: releases: use rsync::quickdatacopy in profile on bromine [puppet] - 10https://gerrit.wikimedia.org/r/363104 [23:50:29] oh for that [23:50:35] let me find my experience [23:50:40] (03CR) 10jerkins-bot: [V: 04-1] releases: use rsync::quickdatacopy in profile on bromine [puppet] - 10https://gerrit.wikimedia.org/r/363104 (owner: 10Dzahn) [23:52:41] Krinkle: https://phabricator.wikimedia.org/T163083#3203162 [23:53:35] (03PS2) 10Dzahn: releases: use rsync::quickdatacopy in profile on bromine [puppet] - 10https://gerrit.wikimedia.org/r/363104 (https://phabricator.wikimedia.org/T164030) [23:54:58] Honestly I think we should remove it for now and get https://phabricator.wikimedia.org/T99531 done ASAP [23:55:49] (03PS3) 10Dzahn: releases: use rsync::quickdatacopy in profile on bromine [puppet] - 10https://gerrit.wikimedia.org/r/363104 (https://phabricator.wikimedia.org/T164030) [23:59:54] (03CR) 10Dzahn: [C: 032] releases: use rsync::quickdatacopy in profile on bromine [puppet] - 10https://gerrit.wikimedia.org/r/363104 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn)