[00:15:54] (03PS4) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [00:47:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [00:48:02] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:50:45] (03CR) 10Krinkle: [C: 031] Use packaged uprightdiff in testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [00:59:02] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [01:00:02] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [03:23:32] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 850.94 seconds [03:35:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 201.11 seconds [04:10:32] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4497.30 Read Requests/Sec=1569.20 Write Requests/Sec=1.40 KBytes Read/Sec=37235.20 KBytes_Written/Sec=462.00 [04:20:32] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=78.20 Write Requests/Sec=1.20 KBytes Read/Sec=1494.40 KBytes_Written/Sec=126.80 [04:50:11] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3219934 (10Josve05a) Have we reverted, or d... [04:50:26] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3219936 (10Josve05a) 05Resolved>03Open [04:51:30] (03PS7) 10Tim Starling: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [04:52:41] (03CR) 10jerkins-bot: [V: 04-1] Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [04:55:31] (03PS8) 10Tim Starling: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [04:57:58] (03CR) 10Tim Starling: "PS7: should be pretty much fully baked and deployable now. Temporarily disabled in production. Moved hostnames to ProductionServices/LabsS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [05:03:31] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3219947 (10Marostegui) [05:05:50] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219949 (10Marostegui) Yes but that's only changing db-eqiad and db-codfw as we normally do when moving a server, as far as I know. [05:32:13] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP), 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3219967 (10Joe) [05:34:29] (03CR) 10Tim Starling: [C: 04-1] conftool: add mwconfig object type, define the first couple variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [05:34:56] (03PS4) 10Tim Starling: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [05:46:51] (03PS5) 10Tim Starling: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [05:50:12] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [05:51:12] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [05:51:59] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3219986 (10Marostegui) @Cmjohnson thanks. Let me coordinate this and we will arrange one day to do the swap. @mmodell is there any problem if we take db1048 down for a few minutes... [05:53:07] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3219987 (10Marostegui) @Cmjohnson it was supposed to be killed soon, but @Ottomata believes it will take a bit longer, so maybe it is worth replacing the BBU. @Ottom... [06:26:51] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3219993 (10Marostegui) Hi Chris, We will take it from here yes. Thanks for getting all this sorted for us! [06:28:13] (03PS6) 10Tim Starling: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [06:28:53] (03CR) 10Tim Starling: "PS6: per your IRC suggestions, except I didn't know what you meant by "make one dc list of variables a reference to the other"" [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [06:41:46] marostegui: hey, can I run a half-an-hour script to clean some rows in ores_classification in enwiki? [06:42:36] hey Amir1, how many rows at the time? are you checking for lag on the slaves? [06:43:15] around 10M, in btaches of 5K, wait for replication to catch up, wait a little bit more, next batch [06:43:57] for ($i = 0; $i < some number; $i++) { $res = $dbr->selectFieldValues( 'ores_classification', 'oresc_id', 'oresc_rev < 761865592', __METHOD__, [ 'LIMIT' => 5000 ] ); $dbw->delete( 'ores_classification', [ 'oresc_id' => $res ], __METHOD__ ); \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->waitForReplication(); var_dump( count( $res ) [06:43:57] ); sleep(2); } [06:44:00] I would decrease the batch to maybe 1k, I know it will take a lot more, but better than generate any lag there [06:44:12] okay, sure [06:44:35] Amir1: we can do a test with 1k and see if we see something, and maybe then increase a bit [06:45:03] I did it yesterday, nothing happened (given that it waits for replication to finish) [06:45:08] I did with 2K [06:45:17] let's go for 2k then [06:45:22] I am monitoring replication now too [06:45:23] Thanks! [06:46:18] how long will that take you think? [06:46:34] I will do it only for half an hour [06:46:52] It takes around 5-6 hours but I do it in several steps [06:48:10] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [06:48:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "This is functionally ok, I'm re-cherry-picking it on beta" [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [06:48:35] !log cleaning around 5-10M rows in ores_classification in enwiki (half-an-hour script, T159753) [06:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:44] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [06:48:53] marostegui: I always keep an eye on https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=1&fullscreen&orgId=1&from=now-1h&to=now [06:49:00] Amir1: ok, let's not leave it running on friday evening since the weekend is almost here, and monday is a public holiday in lots of places [06:49:17] Sure [06:49:31] thanks! :) [06:49:52] I usually do it on mornings my time because the load on the database is the lowest at that time [06:50:05] yeah, that makes total sense :) [06:51:02] some lag popped up, let's add some more sleep time [06:51:53] yeah, i saw some seconds there [06:53:04] <_joe_> I thought monday was a bank holiday almost everywhere [06:53:21] <_joe_> uh I still have ops, meh [06:53:47] And tuesday it is public holiday in madrid too, so I have a pretty long weekend! [07:20:22] 06Operations, 10ops-eqiad, 10Traffic: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#3220052 (10ema) Thanks @Cmjohnson! [07:43:58] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3220054 (10MoritzMuehlenhoff) No, that seem... [07:44:28] 06Operations, 10Monitoring, 10netops: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#2779751 (10ayounsi) For the option "cr1 always backup, cr2 always master" [[ https://github.com/dnsmichi/manubulon-snmp/blob/master/plugins/check_snmp_vrrp.pl | This script ]] does exactly that. Success (... [07:45:10] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3220057 (10Marostegui) Hey @Cmjohnson I have tried to install 3 servers to just make sure they worked fine and we didn't miss anything. And also to make sure we at least have 3 for the swi... [07:48:02] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3220059 (10Marostegui) [07:50:28] (03PS1) 10Marostegui: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350798 (https://phabricator.wikimedia.org/T162539) [07:51:46] (03PS2) 10Marostegui: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350798 (https://phabricator.wikimedia.org/T162539) [07:53:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [07:53:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350798 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [07:54:15] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [07:55:15] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350798 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [07:55:44] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350798 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [07:58:35] PROBLEM - Check health of redis instance on 6380 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 6194716 keys, up 22 hours 26 minutes [07:58:42] <_joe_> that's me ^^ [07:58:44] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1045 - T162539 T163548 (duration: 02m 38s) [07:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:58] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [07:58:58] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [07:59:55] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3220070 (10akosiaris) @halfak, you don't get to know the appropriate passphrase. No deployer ever does. keyholder gets armed by an ops person once (after every deployme... [08:00:35] RECOVERY - Check health of redis instance on 6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 6195174 keys, up 22 hours 28 minutes - replication_delay is 0 [08:03:55] !log cleanup done, 4M rows deleted (T159753) [08:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:03] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:06:33] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220073 (10akosiaris) Great! I'll start undoing some of the preparatory works, that is * repool puppetmaster1002 * switchover oresrdb.svc.eqia... [08:08:16] (03PS1) 10Alexandros Kosiaris: Revert "puppetmaster: Re-depool puppetmaster1002" [puppet] - 10https://gerrit.wikimedia.org/r/350799 (https://phabricator.wikimedia.org/T148506) [08:17:55] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:17:56] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3220083 (10jcrespo) There are some reports running on the slave- We should point the slave to the master to avoid activity there thought the dns alias. [08:21:27] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3220110 (10akosiaris) Heh, I was hoping T162850 would have solved it. It's a bit concerning that a `R420` (it is indeed a R420) has possibly exhibited the same symptoms. The box will be 3 years old next Monda... [08:22:22] (03CR) 10Alexandros Kosiaris: [C: 031] Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 (owner: 10Faidon Liambotis) [08:25:28] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "puppetmaster: Re-depool puppetmaster1002" [puppet] - 10https://gerrit.wikimedia.org/r/350799 (https://phabricator.wikimedia.org/T148506) (owner: 10Alexandros Kosiaris) [08:32:02] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 65.82 seconds [08:38:13] (03Abandoned) 10Elukey: Remove DNS entries for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/346791 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [08:38:25] (03Abandoned) 10Elukey: Depool esams due to networking failures [dns] - 10https://gerrit.wikimedia.org/r/346504 (owner: 10Elukey) [08:47:02] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:01:42] PROBLEM - Host oresrdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:22] RECOVERY - Host oresrdb1001 is UP: PING OK - Packet loss = 0%, RTA = 37.09 ms [09:02:37] !log Upgrade mariadb on db1081 and db1084 from 10.0.23 to 10.0.28 [09:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:15] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3220182 (10akosiaris) [09:03:40] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724981 (10akosiaris) [09:03:42] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3193640 (10akosiaris) 05Resolved>03Open And T148506 is done, re-opening and switching back [09:04:42] PROBLEM - Check health of redis instance on 6380 on oresrdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [09:05:32] RECOVERY - Check health of redis instance on 6380 on oresrdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 21521246 keys, up 3 minutes 28 seconds [09:07:08] weird [09:14:44] that's me ^ [09:14:46] ignore [09:15:11] !log reboot oresrdb1001 for kernel upgrade [09:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:35] !log upgrade mariadb db1071 from 10.0.23 to 10.0.28 [09:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] (03CR) 10Phuedx: "Sorry for adding y'all directly. I'm looking to replicate Ieb902f45. Is this enough?" [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [09:29:45] !log upgrade mariadb db1059,db1056 from 10.0.22 to 10.0.28 [09:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:33] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [09:41:17] (03PS1) 10Alexandros Kosiaris: Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/350807 (https://phabricator.wikimedia.org/T163326) [09:41:22] (03PS2) 10Alexandros Kosiaris: Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/350807 (https://phabricator.wikimedia.org/T163326) [09:41:47] (03PS3) 10Alexandros Kosiaris: Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/350807 (https://phabricator.wikimedia.org/T163326) [09:42:02] (03PS1) 10Jcrespo: mariadb: Add db1097 as a new eqiad s4 slave [puppet] - 10https://gerrit.wikimedia.org/r/350808 (https://phabricator.wikimedia.org/T164057) [09:43:14] (03PS2) 10Jcrespo: mariadb: Add db1097 as a new eqiad s4 slave [puppet] - 10https://gerrit.wikimedia.org/r/350808 (https://phabricator.wikimedia.org/T164057) [09:45:03] (03CR) 10Marostegui: [C: 031] mariadb: Add db1097 as a new eqiad s4 slave [puppet] - 10https://gerrit.wikimedia.org/r/350808 (https://phabricator.wikimedia.org/T164057) (owner: 10Jcrespo) [09:49:49] (03CR) 10Jcrespo: [C: 032] mariadb: Add db1097 as a new eqiad s4 slave [puppet] - 10https://gerrit.wikimedia.org/r/350808 (https://phabricator.wikimedia.org/T164057) (owner: 10Jcrespo) [09:53:47] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/350807 (https://phabricator.wikimedia.org/T163326) (owner: 10Alexandros Kosiaris) [09:56:01] (03PS1) 10Giuseppe Lavagetto: scap::dsh: transition to using confd [puppet] - 10https://gerrit.wikimedia.org/r/350810 [09:56:49] !log installing libxslt security updates on trusty [09:56:53] (03CR) 10jerkins-bot: [V: 04-1] scap::dsh: transition to using confd [puppet] - 10https://gerrit.wikimedia.org/r/350810 (owner: 10Giuseppe Lavagetto) [09:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:35] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3220304 (10Volans) Regarding the implementation of the MW configuration, in particular CR ht... [09:59:42] (03CR) 10Volans: [C: 04-1] "It's great to see this going forward, but I don't agree with the failure model, see my comments inline." (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [10:02:52] 06Operations, 10Wikimedia-Logstash, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3220309 (10elukey) Summary after a chat with Filippo and Andrew. There are two ways to viable approaches: 1) Implement a Kafka 50X topic and use the Logstash... [10:04:44] (03PS2) 10Giuseppe Lavagetto: scap::dsh: transition to using confd [puppet] - 10https://gerrit.wikimedia.org/r/350810 [10:05:30] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3220324 (10elukey) 05Open>03Resolved I'd love to do it anyway, but Chris is super busy and this is only a "good to have" for the moment, so... [10:05:32] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220326 (10elukey) [10:06:50] (03PS5) 10Volans: Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) [10:07:36] (03PS2) 10Giuseppe Lavagetto: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:09:45] (03CR) 10Volans: [C: 032] Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [10:10:08] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:10:50] (03PS3) 10Giuseppe Lavagetto: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:13:44] (03PS1) 10Jcrespo: labsdb: Add redact_sanitarium script to sanitarium (#1, db1069) [puppet] - 10https://gerrit.wikimedia.org/r/350812 [10:14:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Actually, I just realized there is a problem with this patch." [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:17:14] (03PS2) 10Jcrespo: labsdb: Add redact_sanitarium script to sanitarium (#1, db1069) [puppet] - 10https://gerrit.wikimedia.org/r/350812 [10:19:35] (03PS1) 10Elukey: Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) [10:20:24] 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Puppet compiler: sync newest facts only - https://phabricator.wikimedia.org/T157052#3220358 (10Volans) 05Open>03Resolved [10:21:29] (03PS4) 10Giuseppe Lavagetto: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:22:36] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3220367 (10jcrespo) [10:22:52] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895#3213695 (10jcrespo) [10:23:11] (03PS1) 10Muehlenhoff: The build of bootstrap-vz failed in the source package generation stage of pdebuild (i.e. before the build dependencies are installed in the pbuilder chroot): [puppet] - 10https://gerrit.wikimedia.org/r/350815 [10:23:16] (03CR) 10Jcrespo: [C: 032] labsdb: Add redact_sanitarium script to sanitarium (#1, db1069) [puppet] - 10https://gerrit.wikimedia.org/r/350812 (owner: 10Jcrespo) [10:24:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 minor comments, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350810 (owner: 10Giuseppe Lavagetto) [10:24:36] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2557878 (10TheDJ) Can someone be so kind to document on mediawiki.org how to configure this ? Many people there are interested in running electron on... [10:30:32] (03PS2) 10Muehlenhoff: The build of bootstrap-vz failed in the source package generation stage of pdebuild (i.e. before the build dependencies are installed in the pbuilder chroot): [puppet] - 10https://gerrit.wikimedia.org/r/350815 [10:32:41] (03CR) 10Muehlenhoff: [C: 032] The build of bootstrap-vz failed in the source package generation stage of pdebuild (i.e. before the build dependencies are installed in the [puppet] - 10https://gerrit.wikimedia.org/r/350815 (owner: 10Muehlenhoff) [10:34:11] (03PS5) 10Giuseppe Lavagetto: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:35:58] (03PS6) 10Giuseppe Lavagetto: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:36:25] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "PCC says it'a a noop now." [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [10:41:48] (03PS1) 10Filippo Giunchedi: [WIP] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) [10:42:12] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3220411 (10Marostegui) So I did another round: db1096 -> installed The ones that we would need @Cmjohnson to check (no need to be done this week!) db1098 -> after attempting pxe boot: bl... [10:42:31] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3220412 (10Marostegui) [10:42:56] !log reboot oresrdb1002 for kernel upgrade [10:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [10:43:38] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220415 (10akosiaris) [10:43:41] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3220413 (10akosiaris) 05Open>03Resolved And switched back, re-resolving. Thanks! [10:44:12] (03PS3) 10Giuseppe Lavagetto: scap::dsh: transition to using confd [puppet] - 10https://gerrit.wikimedia.org/r/350810 [10:44:49] PROBLEM - Host oresrdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:09] RECOVERY - Host oresrdb1002 is UP: PING OK - Packet loss = 0%, RTA = 38.02 ms [10:45:37] (03PS2) 10Filippo Giunchedi: [WIP] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) [10:46:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [10:47:30] !log migrate/evacuate ganeti2005, ganeti2006 for T164011 [10:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:39] T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011 [10:47:39] PROBLEM - Check health of redis instance on 6380 on oresrdb1002 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [10:47:50] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh: transition to using confd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350810 (owner: 10Giuseppe Lavagetto) [10:48:37] (03PS3) 10Filippo Giunchedi: [WIP] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) [10:48:39] RECOVERY - Check health of redis instance on 6380 on oresrdb1002 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 21550665 keys, up 3 minutes 42 seconds [10:58:52] (03PS1) 10Volans: Mediawiki: update role name for maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/350818 (https://phabricator.wikimedia.org/T160178) [11:02:03] (03CR) 10Volans: [C: 032] Mediawiki: update role name for maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/350818 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:02:11] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3220427 (10elukey) >>! In T161488#3189079, @MoritzMuehlenhoff wrote: > mw2092 is still showing up in servermon: https://servermon.wikimedia.org/hos... [11:07:22] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895#3220434 (10Marostegui) I have downtimed all the slaves in s5,s6 and s7 for 10 hours. [11:16:09] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, PCC https://puppet-compiler.wmflabs.org/6259/" [puppet] - 10https://gerrit.wikimedia.org/r/350775 (owner: 10Faidon Liambotis) [11:17:40] (03PS1) 10Giuseppe Lavagetto: scap::dsh::group: split does regex splitting [puppet] - 10https://gerrit.wikimedia.org/r/350819 [11:22:07] (03PS2) 10Giuseppe Lavagetto: scap::dsh::group: split does regex splitting [puppet] - 10https://gerrit.wikimedia.org/r/350819 [11:23:57] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh::group: split does regex splitting [puppet] - 10https://gerrit.wikimedia.org/r/350819 (owner: 10Giuseppe Lavagetto) [11:24:47] PROBLEM - DPKG on copper is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:25:47] RECOVERY - DPKG on copper is OK: All packages OK [11:30:17] 06Operations, 10Ops-Access-Requests: Request access to SWAP - https://phabricator.wikimedia.org/T164060#3220453 (10Aklapper) See https://phabricator.wikimedia.org/project/profile/956/ for full instructions. Adding #Ops-Access-Requests so someone will see this task. ;) [11:54:17] PROBLEM - DPKG on analytics1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:54:22] (03PS1) 10Giuseppe Lavagetto: scap::dsh::group: add datacenters to scap/conftool config [puppet] - 10https://gerrit.wikimedia.org/r/350823 [11:55:34] ^analytics1027 is fine [11:55:39] (03CR) 10jerkins-bot: [V: 04-1] scap::dsh::group: add datacenters to scap/conftool config [puppet] - 10https://gerrit.wikimedia.org/r/350823 (owner: 10Giuseppe Lavagetto) [11:57:17] RECOVERY - DPKG on analytics1027 is OK: All packages OK [12:00:16] (03PS1) 10Marostegui: wmnet: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) [12:00:36] (03CR) 10Marostegui: [C: 04-2] "Do not submit until eqiad is back as the active DC" [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [12:12:47] are there some problems ? I'm seeing slower responses than usually [12:15:09] not that we are aware of [12:15:10] <_joe_> thedj: https://grafana.wikimedia.org/dashboard/db/performance-metrics?refresh=5m&orgId=1 doesn't suggest that's the case, but doing what specifically? [12:15:13] :) [12:16:01] <_joe_> performance might depend on the specific wiki, of course [12:16:33] i think it's mostly in site js... [12:17:12] resources taking longr to do initial response [12:20:39] 06Operations, 10Ops-Access-Requests: Request access to SWAP - https://phabricator.wikimedia.org/T164060#3220569 (10phuedx) Interesting, I thought I'd added those projects. Thanks, @Aklapper! [12:26:50] That may be because a lot of site js broke with wmf.21 ? [12:27:26] A ton of deprecated methods was removed. Users in -tech already reported that this breaks a lot of user scripts [12:27:27] https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2017-April/001574.html [12:28:05] thedj: ^ [12:31:17] eddiegp: possibly related [12:36:47] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3220599 (10Liuxinyu970226) [12:39:49] (03PS2) 10Giuseppe Lavagetto: scap::dsh::group: add datacenters to scap/conftool config [puppet] - 10https://gerrit.wikimedia.org/r/350823 [12:40:23] thedj: I'm going to see if it will affect myself on enwiki if so it may global [12:41:59] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh::group: add datacenters to scap/conftool config [puppet] - 10https://gerrit.wikimedia.org/r/350823 (owner: 10Giuseppe Lavagetto) [12:42:42] Trizek: ^ [12:46:37] PROBLEM - confd service on mira is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [12:46:57] PROBLEM - confd service on bast1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [12:47:10] _joe_: ^^^ [12:47:51] * Trizek puts his glasses on. [12:49:15] thedj: it appears that none of my userscripts are affected... so either i use all updated userscripts or its only certain wikis for some reason [12:50:05] left a note here: https://phabricator.wikimedia.org/T122755#3220618 [12:50:05] I have reports from fr.wp and I'm patrolling VPTs to see if there is more issues reported. [12:50:16] But none of my userscripts are affected on fr. [12:51:13] 06Operations, 06Labs, 10Tool-Labs: Upgrade bootstrap-vz version for tools docker builder - https://phabricator.wikimedia.org/T157526#3220620 (10MoritzMuehlenhoff) [12:51:15] Trizek: most of what i have seen is addOnloadhook, addPortallink directly in userscripts. And a lot of mw.util call that do not ensure their dependency on mediawiki.util module [12:51:23] 06Operations, 10ops-eqiad, 15User-fgiunchedi: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385#3220621 (10fgiunchedi) @Cmjohnson yeah today at 10AM your time works for me, if not monday works too [12:52:45] This problem is definitely a good reason to have common scripts accross wikis. [12:52:47] (03PS1) 10Muehlenhoff: Switch to stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/350829 (https://phabricator.wikimedia.org/T157526) [12:53:37] PROBLEM - confd service on tegmen is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [12:53:57] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: Upgrade bootstrap-vz version for tools docker builder - https://phabricator.wikimedia.org/T157526#3220625 (10MoritzMuehlenhoff) That's actually already fixed by the bootstrap-vz version in stretch, I've backported it for jessie-wikimedia along with bac... [12:54:58] thedj: I wonder if we should maybe send out another notice? [12:55:48] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:55:53] Zppix: that's always a hard cost/benefit to do. [12:55:59] _joe_: seems that the change is making the icinga check failing, the unit seems failing: Active: inactive (dead) [12:56:20] thedj: I would do it but as I have no access to do so we'd have to find someone with access [12:56:35] (03PS9) 10Tim Starling: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [13:00:18] PROBLEM - confd service on bast4001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [13:00:31] (03CR) 10Tim Starling: "Failing fast is fine by me. It makes things simpler since all we have to do is call EtcdConfig::get(), and it will throw a ConfigException" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [13:01:32] Zppix: What type of notice do you think about? Wondering as you're refering to access. I mean, reaching out to mailing lists and village pumps is how we usually announce technical changes. Besides that we already did that, there's no need for some specific access for that, so you seem to have something different on your mind? [13:01:56] (03PS1) 10Giuseppe Lavagetto: scap::dsh::group: fix newline template, cassandra DCs [puppet] - 10https://gerrit.wikimedia.org/r/350831 [13:02:27] <_joe_> volans: tegmen has dsh::groups? [13:02:30] <_joe_> ouch :P [13:02:40] eddiegp: i was thinking mass msg or central notice [13:02:45] _joe_: bast[1001,2001,3002,4001].wikimedia.org,einsteinium.wikimedia.org,mira.codfw.wmnet,naos.codfw.wmnet,tegmen.wikimedia.org,tin.eqiad.wmnet [13:02:45] <_joe_> volans: I should be able to fix that in a few minutes [13:02:49] ok thanks [13:02:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [13:03:57] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [13:04:20] (03PS1) 10Ema: Release 4.1.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350832 [13:04:38] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /srv 61958 MB (12% inode=99%) [13:06:15] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh::group: fix newline template, cassandra DCs [puppet] - 10https://gerrit.wikimedia.org/r/350831 (owner: 10Giuseppe Lavagetto) [13:06:32] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3220647 (10Aklapper) 05Open>03Resolved... [13:08:18] (03CR) 10jerkins-bot: [V: 04-1] Release 4.1.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350832 (owner: 10Ema) [13:08:25] Zppix: I don't think it affects all users though and of those who are affected most just use those scripts and aren't into scripting themselves. So I don't think a central notice is appropriate, IMHO that's for things affecting (almost) everybody (like DC switch, stewards election etc.) [13:08:37] RECOVERY - confd service on mira is OK: OK - confd is active [13:09:23] <_joe_> ok, one more and it should be done [13:09:23] eddiegp: i was thinking more of central notice talking about what to do if your affected instead of people constantly coming to -tech just to be told the same thing [13:11:29] (03PS1) 10Giuseppe Lavagetto: scap::dsh::group: fix regression in conftool template [puppet] - 10https://gerrit.wikimedia.org/r/350833 [13:12:05] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] scap::dsh::group: fix regression in conftool template [puppet] - 10https://gerrit.wikimedia.org/r/350833 (owner: 10Giuseppe Lavagetto) [13:12:57] Zppix: idk, do so (ask somebody/create a task for it) if you mind [13:13:24] (03CR) 10Alexandros Kosiaris: [C: 031] Switch to stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/350829 (https://phabricator.wikimedia.org/T157526) (owner: 10Muehlenhoff) [13:21:40] RECOVERY - confd service on tegmen is OK: OK - confd is active [13:21:52] (03PS2) 10Muehlenhoff: Switch to stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/350829 (https://phabricator.wikimedia.org/T157526) [13:22:45] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3220724 (10Cmjohnson) @elukey Is this okay to power off? [13:23:37] cmjohnson1: the host is completely borked, feel free to go! [13:23:46] cool..thx [13:23:52] cmjohnson1: morning! :-) [13:23:59] good morning! [13:24:10] um...afternoon [13:24:13] (03CR) 10Muehlenhoff: [C: 032] Switch to stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/350829 (https://phabricator.wikimedia.org/T157526) (owner: 10Muehlenhoff) [13:24:25] cmjohnson1: when you are done with that task, let me know when you want to start with T163895 [13:24:26] T163895: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895 [13:24:59] marostegui: let's do yours....the other is not pressing atm [13:25:01] ideally if you can send us the IPs in advance, we can change them in the host so it boots up with the correct one directly once it is moved, and we also change the db-eqiad and db-codfw.php files [13:25:06] cmjohnson1: cool! [13:25:32] whichever host you want to start with I am fine [13:25:36] sure...give me a a few mins and I will send those to you db1062 does not need a change since it's staying in row D [13:25:46] ah great :) [13:26:29] !log Stop MySQL and shutdown db1062 - T163895 [13:26:34] jynus: ^ [13:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] ok, do we need to depool it from mediawiki? [13:27:05] it is a master, so I guess no? :) [13:27:14] ok [13:27:26] going to shutdown mysql first, without powering the server off [13:27:29] to see if something arises [13:27:34] we can quickly start mysql again [13:27:49] (03PS2) 10Ema: Release 4.1.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350832 [13:28:46] marostegui: are we moving these 1 at a time or all 3 at once [13:28:48] ? [13:29:07] cmjohnson1: I would prefer to move one at the time if that is fine by you [13:29:09] cmjohnson1, whatever is easier for you, I assume the 3 of them? [13:29:17] he [13:29:20] RECOVERY - confd service on bast4001 is OK: OK - confd is active [13:29:21] haha [13:29:24] whatever manuel says [13:29:42] mysql on db1062 is down, let's see if something happens [13:29:42] one at a time is fine.....just will 1 dns change at a time. Let's start with db1061 moving to row C [13:29:43] on logtash [13:30:00] cmjohnson1: ok, db1061 it is then [13:30:03] 10.64.32.267 [13:30:11] !log Stop MySQL and shutdown db1061 - T163895 [13:30:14] marostegui, we can wait for some error to happen [13:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:19] T163895: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895 [13:30:21] but then do the 3 of them [13:30:21] yeah, that is what I want [13:30:24] let me know when you power off [13:30:30] sure [13:30:44] going to shutdown mysql and change the ip [13:32:12] (03PS1) 10Muehlenhoff: Fix compatability with current bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350835 [13:33:01] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) [13:34:00] (03PS1) 10Cmjohnson: Changing production dns for db1061, moving to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350837 [13:34:16] (03CR) 10jerkins-bot: [V: 04-1] Changing production dns for db1061, moving to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350837 (owner: 10Cmjohnson) [13:34:17] (03CR) 10Jcrespo: "position change" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:35:15] (03PS3) 10Ema: Release 4.1.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350832 [13:35:17] what, 267 in an ip? [13:35:55] cmjohnson1 ^ [13:35:55] that doesn't make sense [13:36:22] (03CR) 10Jcrespo: [C: 04-1] "Cannot be .267" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:36:54] 227 maybe? [13:37:27] maybe, it is free at least [13:37:30] let's wait for cmjohnson1 [13:37:52] sorry 227 [13:38:04] 267 doesn't makes sense [13:38:05] good catch for the linter [13:38:09] :-) [13:38:25] I think we have exhausted chris too much this week [13:38:33] yeah [13:38:39] sorry! [13:38:40] RECOVERY - Disk space on elastic1022 is OK: DISK OK [13:38:45] (03PS3) 10Elukey: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [13:39:07] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) [13:39:18] 06Operations, 06Release-Engineering-Team, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3220763 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:39:26] cmjohnson1: db1061 is now down [13:40:23] (03CR) 10Ema: [V: 032 C: 032] Release 4.1.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350832 (owner: 10Ema) [13:41:34] https://gerrit.wikimedia.org/r/#/c/350836/ -> looks good? [13:41:38] jynus? [13:41:54] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:42:00] \o( [13:42:08] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:42:22] (03PS2) 10Cmjohnson: Changing production dns for db1061, moving to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350837 [13:42:31] (03CR) 10Elukey: [C: 032] "Version looks consistent with https://gerrit.wikimedia.org/r/#/c/350503" [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [13:42:37] (03PS2) 10Filippo Giunchedi: Scap: update version to 3.5.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/350757 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [13:42:40] (03CR) 10Jcrespo: [C: 031] Changing production dns for db1061, moving to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350837 (owner: 10Cmjohnson) [13:43:00] (03CR) 10Cmjohnson: [C: 032] Changing production dns for db1061, moving to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350837 (owner: 10Cmjohnson) [13:43:13] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:43:24] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1061 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350836 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [13:44:21] !log T163936: forcing puppet run on restbase1007 [13:44:29] 06Operations, 15User-fgiunchedi: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673#3220772 (10fgiunchedi) @Cmjohnson not ATM, initially I thought it was a HW raid config issue but doesn't look like it, thanks! [13:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:30] T163936: Latency metrics missing - https://phabricator.wikimedia.org/T163936 [13:44:58] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Change db1061 IP - T163895 (duration: 01m 19s) [13:45:00] RECOVERY - confd service on bast1001 is OK: OK - confd is active [13:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:06] T163895: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895 [13:45:45] (03PS1) 10Giuseppe Lavagetto: profile::scap::dsh: define, use everywhere instead of scap::dsh [puppet] - 10https://gerrit.wikimedia.org/r/350842 [13:45:53] 06Operations, 06Release-Engineering-Team, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3220775 (10MoritzMuehlenhoff) I don't think we need a trusty image. The remaining services run on trusty are in the process of migra... [13:45:58] urandom: I'll take a look on graphite1003 just in case but I don't think anything will happen [13:46:02] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3220776 (10Dzahn) >>! In T147905#3220110, @akosiaris wrote: > Heh, I was hoping T162850 would have solved it. It's a bit concerning that a `R420` (it is indeed a R420) has possibly exhibited the same symptoms... [13:46:03] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Change db1061 IP - T163895 (duration: 01m 00s) [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] (03PS1) 10Muehlenhoff: Build a stretch image [puppet] - 10https://gerrit.wikimedia.org/r/350843 (https://phabricator.wikimedia.org/T162042) [13:46:19] !log T163936: restarting cassandra-metrics-collector on restbase1007 [13:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:54] godog: i guess the big thing would be to make sure that none of those Table metrics are being collected [13:47:02] godog: which i can easily check for too [13:47:26] urandom: ack, thanks! yeah I'm not seeing any new metric created, LGTM [13:48:09] (03PS3) 10Filippo Giunchedi: Scap: update version to 3.5.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/350757 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [13:49:09] marosgegui/jynus powering up now [13:49:13] awesome! [13:49:22] db1062 is ready to be powered off if you want [13:49:23] (03CR) 10Filippo Giunchedi: [C: 032] Scap: update version to 3.5.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/350757 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [13:49:57] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3220784 (10akosiaris) Thanks, I 've updated my comment. [13:50:06] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::scap::dsh: define, use everywhere instead of scap::dsh [puppet] - 10https://gerrit.wikimedia.org/r/350842 (owner: 10Giuseppe Lavagetto) [13:50:34] godog: i wonder if we should consider rewriting in cmcd [13:50:45] (03PS2) 10Giuseppe Lavagetto: profile::scap::dsh: define, use everywhere instead of scap::dsh [puppet] - 10https://gerrit.wikimedia.org/r/350842 [13:50:46] !log varnish 4.1.6-1wm1 uploaded to apt.w.o [13:50:51] marostegui db1062...will d4 work? [13:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::scap::dsh: define, use everywhere instead of scap::dsh [puppet] - 10https://gerrit.wikimedia.org/r/350842 (owner: 10Giuseppe Lavagetto) [13:50:55] d5 is full [13:50:58] cmjohnson1: checking [13:51:03] so that in this case, you could s/metrics\.Table/metrics.ColumnFamily/ [13:51:13] could also do d8 [13:51:16] cmjohnson1: yep, that works, d4 [13:51:18] guess that only matters if we upgrade [13:51:22] great....moving there [13:51:30] db1062? [13:51:32] let me power off then [13:51:55] should be off [13:52:28] urandom: mhh yeah in the upgrade case it'd be handy to avoid having to rename the metrics (and the dashboards) [13:52:44] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895#3220785 (10Marostegui) [13:52:53] (03PS1) 10Marostegui: db-eqiad.php: Change db1062 location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350844 (https://phabricator.wikimedia.org/T163895) [13:53:18] godog: also, for new cluster, you could use it to make a clean break, and use friendlier names [13:53:26] clusers, that is [13:53:40] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [13:53:41] drop the org.apache.cassandra.metrics from all of them [13:53:45] for example [13:53:54] (03PS1) 10Volans: Puppet: fix compiler facts import [puppet] - 10https://gerrit.wikimedia.org/r/350845 (https://phabricator.wikimedia.org/T157052) [13:54:40] (03CR) 10Volans: Puppet: fix compiler facts import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350845 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [13:55:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks fine, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [13:55:35] !log $ readlink /usr/local/lib/cassandra-metrics-collector/cassandra-metrics-collector.jar [13:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:45] sheesh [13:55:48] paste fail. [13:56:01] cmjohnson1: was the dns change deployed for db1061? [13:56:08] yes [13:56:14] !log T163936: restarting cassandra-metrics-collector, restbase production [13:56:19] ok, let's wait a bit longer so all the hosts see it [13:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:23] T163936: Latency metrics missing - https://phabricator.wikimedia.org/T163936 [13:56:25] the host is up btw [13:56:39] marostegui, maybe it still has the old ip? [13:56:44] no no [13:56:46] the host is fine [13:56:48] ah [13:56:52] and the new ip replies [13:56:55] ok [13:56:58] cmjohnson@baham:~$ host db1061.eqiad.wmnet [13:56:59] db1061.eqiad.wmnet has address 10.64.32.227 [13:57:04] so it is dns [13:57:10] root@neodymium:/home/marostegui/git/software/dbtools# host db1061 [13:57:10] db1061.eqiad.wmnet has address 10.64.48.14 [13:57:11] yep :) [13:57:33] puppet [13:58:15] I am ready for db1062 whenever you are [13:58:23] cmjohnson1: it is off :) [13:58:32] (03PS2) 10Volans: Puppet: fix compiler facts import [puppet] - 10https://gerrit.wikimedia.org/r/350845 (https://phabricator.wikimedia.org/T157052) [13:58:48] k [13:59:43] !log installing ghostscript security updates [13:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:01] the dns is getting changed, some slaves start seeing the new ip sometimes :) [14:00:13] (03PS1) 10Addshore: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 [14:00:36] (03CR) 10Volans: [C: 032] Puppet: fix compiler facts import [puppet] - 10https://gerrit.wikimedia.org/r/350845 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [14:01:00] (03CR) 10Addshore: [C: 04-2] "TODO:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [14:01:42] (03CR) 10Addshore: [C: 04-2] "Scheduled for the 9th May" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [14:01:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Change db1062 location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350844 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:02:40] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:02:46] (03Merged) 10jenkins-bot: db-eqiad.php: Change db1062 location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350844 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:02:51] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:02:51] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:02:51] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:02:56] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3220814 (10akosiaris) No, that rack schema wont work. What we want to do is have 4 boxes per rack row. I am already emptying ganeti2005, ganeti2006. Those 2, alongside ganeti2007... [14:02:58] (03CR) 10jenkins-bot: db-eqiad.php: Change db1062 location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350844 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:03:07] marostegui: db1062 is powering up [14:03:11] awesome! [14:04:00] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Change db1062 rack location - T163895 (duration: 00m 52s) [14:04:00] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:04:00] (03PS1) 10Addshore: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 [14:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:08] T163895: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895 [14:04:41] !log Stop and shutdown db1063 - T163895 [14:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:06:53] db1062 is now up [14:07:01] is db1061? [14:07:17] it is up but dns still not changed across the fleet [14:07:23] I'll check the puppet failures for scap, though it should recover at the next puppet run [14:07:26] it is getting there :) [14:07:50] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:07:50] PROBLEM - MariaDB Slave Lag: s3 on db1035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 119814.59 seconds [14:07:50] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:07:58] I just it wanted changed for the bastion [14:08:00] PROBLEM - MariaDB Slave Lag: s3 on db1044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 119658.69 seconds [14:08:00] PROBLEM - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 112730.43 seconds [14:08:11] ? [14:08:20] oh, I will look at those downtimes finishing [14:08:20] PROBLEM - MariaDB Slave Lag: s3 on db1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 105588.15 seconds [14:08:25] (03CR) 10Addshore: [C: 04-2] "TODO attach ticket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 (owner: 10Addshore) [14:08:28] ignore the s3 alerts [14:08:30] ok [14:08:31] thanks [14:08:34] cmjohnson1, marostegui I don't see the removal of the old record of db1061 [14:08:35] the downtime expired [14:08:37] in 4a80a64213f4f9b1d796a2030b56ce50f474eb61 [14:08:40] took more than expected [14:08:42] +227 1H IN PTR db1061.eqiad.wmnet. [14:08:47] 3 days and counting [14:09:00] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:09:14] volans: https://phabricator.wikimedia.org/P5344 [14:09:15] the reverse was replaced but the record was added, not replaced [14:09:48] (03CR) 10Phuedx: Reading Web Page Previews alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [14:09:54] yes because we have 2 in the codfig [14:10:00] volans: true [14:10:01] just saw it [14:10:15] se metto 350845/3 funzionera'? :Dtemplates/10.in-addr.arpa:227 1H IN PTR db1061.eqiad.wmnet. [14:10:19] templates/10.in-addr.arpa:14 1H IN PTR db1061.eqiad.wmnet. [14:10:23] I have added downtime until tuesday [14:10:25] * volans pastefail [14:10:37] <_joe_> win 19 [14:10:42] (03PS9) 10Giuseppe Lavagetto: scap::source: also define the corresponding dsh group [puppet] - 10https://gerrit.wikimedia.org/r/306431 [14:10:43] (03PS1) 10Giuseppe Lavagetto: scap::dsh: drop 'recurse => true' on directory [puppet] - 10https://gerrit.wikimedia.org/r/350849 [14:10:50] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:10:51] volans: no, that was me not removing it [14:10:58] cmjohnson1: I am sending the patch [14:11:00] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:11:13] cmjohnson1: git grep db1061.eqiad.wmnet [14:11:25] I have 2 entries [14:11:40] yeah..no, I realize what went wrong...I forgot to remove it when I added the new one [14:13:46] (03PS1) 10Marostegui: Remove db1061 old IP [dns] - 10https://gerrit.wikimedia.org/r/350850 (https://phabricator.wikimedia.org/T163895) [14:13:47] cmjohnson1 volans jynus ^ [14:14:13] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh: drop 'recurse => true' on directory [puppet] - 10https://gerrit.wikimedia.org/r/350849 (owner: 10Giuseppe Lavagetto) [14:14:19] (03PS2) 10Giuseppe Lavagetto: scap::dsh: drop 'recurse => true' on directory [puppet] - 10https://gerrit.wikimedia.org/r/350849 [14:14:21] I don't know which one of the 2 is the right one, but yes, go ahead [14:14:24] (03CR) 10Jcrespo: [C: 031] Remove db1061 old IP [dns] - 10https://gerrit.wikimedia.org/r/350850 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:14:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] scap::dsh: drop 'recurse => true' on directory [puppet] - 10https://gerrit.wikimedia.org/r/350849 (owner: 10Giuseppe Lavagetto) [14:14:40] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:14:40] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [14:14:50] <_joe_> godog: ^^ [14:14:53] (03CR) 10Marostegui: [C: 032] Remove db1061 old IP [dns] - 10https://gerrit.wikimedia.org/r/350850 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:16:52] db1061 is getting slaves connected to it so it is all good [14:17:03] cmjohnson1: when you have an IP for db1063 let me know :) [14:17:36] 32.228 [14:17:40] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:17:50] great [14:18:05] _joe_: yeah not sure why those sporadic failures, running puppet again does the right thing [14:18:30] <_joe_> meh [14:18:37] <_joe_> isn't scap a deb package? [14:18:43] (03PS1) 10Cmjohnson: adding dns entry for db1063 relocating to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350851 [14:18:43] it is [14:18:52] godog: this might be been indirectly triggered by the ghostscript update? [14:19:09] at least that was during the time when the ghostscript upgrade ran [14:19:15] cmjohnson1: server is off [14:19:16] moritzm: ah, possibly if puppet was running but not via cron then yeah, 'apt update' wouldn't have been run [14:19:19] k [14:19:27] that'd explain it [14:19:48] likewise for the puppet failures at 16:02 in eqiad [14:20:00] (03PS10) 10Giuseppe Lavagetto: scap::source: also define the corresponding dsh group [puppet] - 10https://gerrit.wikimedia.org/r/306431 [14:20:02] yeah that's likely it then, race [14:20:02] (03CR) 10Cmjohnson: [C: 032] adding dns entry for db1063 relocating to row C T163895 [dns] - 10https://gerrit.wikimedia.org/r/350851 (owner: 10Cmjohnson) [14:20:17] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6263/ this is a noop at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/306431 (owner: 10Giuseppe Lavagetto) [14:21:10] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db1063 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350853 (https://phabricator.wikimedia.org/T163895) [14:24:51] Hmm, i get this puppet error now [14:24:51] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item scap::dsh::groups in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/scap/dsh.pp:5 on node phab-tin.phabricator.eqiad.wmflabs [14:24:51] Warning: Not using cache on failed catalog [14:24:51] Error: Could not retrieve catalog; skipping run [14:26:29] marostegui db1063 is powering up [14:26:38] ook great! [14:26:51] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3220916 (10akosiaris) I 've fully removed ganeti2005, ganeti2006 from the cluster and downtimed them in icinga for 2 months, then turned them off. Feel free to unrack them whenever... [14:27:07] <_joe_> paladox: define that in hiera [14:27:14] <_joe_> paladox: scap::dsh::groups: {} [14:27:19] ah thanks [14:27:30] <_joe_> sorry, I checked for deployment-tin of course [14:27:40] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::source: also define the corresponding dsh group [puppet] - 10https://gerrit.wikimedia.org/r/306431 (owner: 10Giuseppe Lavagetto) [14:27:47] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db1063 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350853 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:28:32] _joe_ i now get this error https://phabricator.wikimedia.org/P5346 [14:29:01] <_joe_> paladox: that's why I said to define it as {} [14:29:07] I did [14:29:17] it converted it into [ ] [14:29:18] https://wikitech.wikimedia.org/wiki/Hiera:Phabricator [14:29:45] <_joe_> ok so that's not a problem I can solve for you, sorry :/ [14:29:54] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1063 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350853 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:30:03] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:30:08] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1063 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350853 (https://phabricator.wikimedia.org/T163895) (owner: 10Marostegui) [14:30:14] <_joe_> that's a problem of the wikitech hiera interface [14:30:55] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3220924 (10Ottomata) Not difficult at all. I think this server is not used often, only really when there are issues with dbstore1002. @cmjohnson, let me know what d... [14:31:03] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:31:13] <_joe_> paladox: open a bug to the labs team [14:31:18] <_joe_> sorry, cloud services team [14:31:21] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Change db1063 IP and rack - T163895 (duration: 00m 50s) [14:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:29] T163895: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895 [14:32:15] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Change db1063 IP and rack - T163895 (duration: 00m 48s) [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:46] cmjohnson1: I can reach db1063 now. Will do all the checks and close the ticket. Thanks a lot for your help with this last minute request, really appreciate it [14:34:32] (03CR) 10Ottomata: [C: 031] "I'd consider piping the 5xx logs into Kafka from kafkatee, rather than sending directly to logstash. Then you have a reliable buffer, tha" [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [14:34:49] hi [14:35:19] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3220951 (10Ottomata) @Cmjohnson yes, it is already 'off' from our point of view :) [14:38:06] (03PS2) 10Phuedx: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 [14:38:10] (03CR) 10Phuedx: Reading Web Page Previews alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [14:38:47] (03PS3) 10Phuedx: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 [14:39:32] _joe_ i resolved it by setting the default for groups here https://github.com/wikimedia/puppet/commit/9b15276d969f270786cdd4c7042dfc222c111a1c to {} [14:39:48] can we do that? Since there are defaults for other configs in that file? [14:42:53] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:43:27] <_joe_> paladox: no, because that's the only config that needs to be defined [14:43:30] <_joe_> even if empty [14:43:40] Oh, it fixes it for me if i set {} there. [14:43:54] <_joe_> I won't change the correct logic of the code because there is a bug in the hiera interface in mediawiki [14:43:59] <_joe_> sorry, wikitech [14:44:12] <_joe_> paladox: you have your own puppetmaster? [14:44:18] yep [14:44:48] <_joe_> so, just go to /etc/puppet/hieradata/labs/$your_project/common.yaml [14:44:50] 06Operations, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3220982 (10fgiunchedi) I started doing some analytics with hive on `webrequest` data for upload, reporting the queries here for reference. Note that running a query ove... [14:44:55] <_joe_> and define the variable there [14:45:03] <_joe_> removing it from wikitech [14:45:05] <_joe_> it will work [14:45:17] ok [14:45:27] I can do it through horizion [14:45:51] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move masters away from D1 in eqiad - https://phabricator.wikimedia.org/T163895#3220983 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This has all been completed. The masters have now slaves connected to them again: ``` root@neodymium:/home/maro... [14:45:51] <_joe_> paladox: that too :) [14:46:02] yep :) [14:47:14] works :) [14:48:07] 06Operations, 06Analytics-Kanban, 10DBA: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3220992 (10elukey) [14:49:16] 06Operations, 06Analytics-Kanban, 10DBA: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3220845 (10elukey) @akosiaris: After a chat with Jaime I'd like to explore the possibility of using bacula, but I was told to double check with you requirements. Do yo... [14:51:30] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3221001 (10Cmjohnson) Failed to get through post, fails at initializing idrac that eventually times out and tries again. Most likely a system board replacement... [14:52:18] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3221004 (10elukey) analytics1003 was done, so to complete the work we'd need to reimage: * stat100[23] * analytics1030 (down for maintenance) All t... [14:52:35] (03PS2) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [14:53:20] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3221010 (10Ottomata) We dont' need to reimage stat100[23]. They should be decommed this quarter. [14:53:30] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3221011 (10Cmjohnson) Also, receive a message that bbu is discharged [14:55:44] 06Operations, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3221023 (10fgiunchedi) And a rough estimation of the long tail, note that ~60% of sizes have been requested less than 1000 times in april. Only 4% of sizes are requeste... [14:55:53] !log shutting down elastic2020 for mainboard replacement - T149006 [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [14:56:46] (03CR) 10Muehlenhoff: [C: 031] Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [15:00:44] (03CR) 10Dzahn: [C: 032] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [15:00:50] (03PS8) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [15:01:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this is a good idea but given we're taking this path, I'd add some intelligence to the define. See the couple of comments about th" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:02:51] (03CR) 10Dzahn: [C: 032] "that's it, precise. today was officially your last day of support." [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [15:03:18] (03CR) 10Giuseppe Lavagetto: [C: 031] Swap mc1001->mc1012 with mc1019->mc2030 [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [15:07:45] (03PS19) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [15:13:24] 06Operations, 06Analytics-Kanban, 10DBA: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3221132 (10akosiaris) Depends on how often you want it backed up and the rate of growth. So mysql needs to be dumped in some way before it is backed up as backing up... [15:15:07] 06Operations, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3221138 (10fgiunchedi) Some more frequency distributions of size vs number of requests using bitly's data hacks {P5347} [15:17:05] cmjohnson1: I'm good to do prometheus mem upgrade btw if you have time [15:18:13] godog: sure give me 5 mins [15:18:44] godog: if you need power anything off go ahead and do it now [15:19:16] cmjohnson1: kk, we'll need to power off one machine at a time, I'll start with prometheus1003 [15:21:53] !log poweroff prometheus1003 for ram upgrade - T163385 [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:02] T163385: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385 [15:22:49] cmjohnson1: prometheus1003 is powered off now [15:23:11] ok [15:23:48] 06Operations, 06Analytics-Kanban, 10DBA: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3220845 (10jcrespo) I agree with most of things said, and I actually mentioned some of those to luka on IRC. BTW, for the record- the best way to move forward regardi... [15:29:03] godog robh the numbers do not add up...there is only 1 (16B) stick in prometheus1003 not 32. The most I can add to one box is 64 and then 96 in the other. We'll need to order 2 more sticks. [15:29:15] Trizek: if people report that a language has the 'wrong language', then usually that means something went wrong when rebuilding the translations after a deploy [15:29:36] cmjohnson1: that is odd [15:29:40] we orderd those machiens iwth 32 [15:29:46] so it arrived wrong and was never caught? [15:29:55] thedj: I'm lost. Do you have any context? [15:30:02] godog: if this setup works for you now which do you prefer gets the 96 [15:30:12] "But why is Finished used as a translation backup for British English?" [15:30:20] cmjohnson1: sigh, yeah that works for me, does it for you robh ? [15:30:25] cmjohnson1: If prometheus1003 doesn't have 32, then we fucked up [15:30:26] https://phabricator.wikimedia.org/T149339 [15:30:36] no, we ordered a machine and it arrived with less ram than ordered [15:30:41] so this is going to result in a large ordeal. [15:30:44] (with me, and the vendors) [15:30:52] unfortunately, we didnt catch it until now? [15:31:01] I see 32gb on prometheus1004 btw [15:31:20] https://phabricator.wikimedia.org/T149339 is the purchase of 2 prometheus machines for each site [15:31:25] with 32GB [15:31:32] ditto prometheus1003 shows up with 32gb from the graphs [15:31:35] then we're one stick short [15:31:48] it showed 32 in software.... you sure? [15:32:05] robh: yep https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=prometheus1003&var-datasource=eqiad%20prometheus%2Fops [15:33:17] okay, thanks thedj [15:33:36] godog: yeah i was agreeing with you and questioning what chris found =] [15:33:45] ah ok, misread [15:33:45] cmjohnson1: perhaps they shipped with 32GB dimm, please investigate [15:33:58] it would also be the first time dell ever sent us a system short of memory [15:34:23] cmjohnson1: if it isnt detecting all the memory after adding some, i'll manually check memory total by counting them [15:34:35] and compariing labels, cuz yeah, all softare before showed 32GB on 1003 [15:35:02] cmjohnson1: not trying to sound condecending or anthing! [15:35:09] just seems odd that it doesnt match [15:36:04] godog robh nevermind...it's there [15:36:57] forget all about this conversation....it never happened [15:37:18] im just happy i dont have to call dell. [15:38:20] hahah ok, even better this way [15:40:31] !log deploying new events_coredb_slave.sql on codfw T160984 [15:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:40] T160984: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984 [15:45:12] (03PS1) 10Alexandros Kosiaris: librenms: Introduce scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/350861 (https://phabricator.wikimedia.org/T129136) [15:47:24] godog booting up [15:47:48] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3221263 (10Dzahn) mainboard is being replaced right now [15:48:37] s1 and s2 look good, going for the other shards [15:51:43] cmjohnson1: nice, 1003 is back with 96gb [15:54:13] godog: great lmk when you're ready for the next one [15:55:50] (03CR) 10Dzahn: [C: 04-1] "compiler says "Error: Invalid parameter ipv4 on Interface::Ip[phabricator vcs]" http://puppet-compiler.wmflabs.org/6264/iridium.eqiad.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [15:56:26] cmjohnson1: kk, powering off 1004 now [15:56:56] !log poweroff prometheus1004 for ram upgrade - T163385 [15:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] T163385: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385 [15:57:15] cmjohnson1: should be off shortly [15:59:06] (03PS1) 10Andrew Bogott: Nova api policy: Open up os_compute_api:os-floating-ips [puppet] - 10https://gerrit.wikimedia.org/r/350868 (https://phabricator.wikimedia.org/T164085) [15:59:13] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3221288 (10Cmjohnson) [15:59:14] 06Operations, 10ops-eqiad, 10Phabricator: phab1001 hdd port a failure - https://phabricator.wikimedia.org/T163960#3221286 (10Cmjohnson) 05Open>03Resolved Received and swapped the disk in port A Return tracking number is USPS 9202394653012435442374 FEDEX 9611918 2393026 72192398 [16:00:27] (03CR) 10Dzahn: [C: 04-1] "should use the new "interface::alias" instead of "interface::ip" and needs https://gerrit.wikimedia.org/r/#/c/350773/ to be merged first. " [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [16:04:11] godog: powering up now [16:07:27] cmjohnson1: yep we're back, checking, I'll resolve the task once done, thanks! [16:08:24] Hey, there's an error with Special:RevisionReview [16:08:32] yah [16:08:39] great..thx [16:09:08] works in pc with regular rollback but nothing else [16:09:38] not reproducible anymore because Chrissymad just rollbacked it :P [16:09:46] sorry :P [16:10:00] error code: Request from *insert ip here* via cp2016 cp2016, Varnish XID 482097163 [16:10:00] [18:07:56] Error: 503, Service Unavailable at Fri, 28 Apr 2017 16:07:37 GMT [16:10:38] Request from *ip* via cp2010 cp2010, Varnish XID 269244643 [16:10:38] Error: 503, Service Unavailable at Fri, 28 Apr 2017 16:05:10 GMT also [16:11:24] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3218244 (10Dzahn) We have lots of room in A2 and A4 and we can move into A4, but we can't move into A2 because there is a 10G switch and the server just has 1G nic cards (would nee... [16:12:08] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3221343 (10mmodell) @Marostegui: correct, phabricator isn't currently querying the slave, other than the reports mentioned by @jcrespo. [16:12:13] 06Operations, 06Release-Engineering-Team, 05Goal, 13Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3150656 (10mobrovac) >>! In T162042#3220775, @MoritzMuehlenhoff wrote: > I don't think we need a trusty image. The remaining services ru... [16:12:44] Hello anyone know if the issues from yesterday are back? Some users are reporting issues with pending changes on en? [16:13:03] 06Operations, 10ops-eqiad, 15User-fgiunchedi: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385#3221352 (10fgiunchedi) 05Open>03Resolved Both machines upgraded and back with 96GB, thanks @Cmjohnson ! [16:14:28] (03CR) 10Andrew Bogott: [C: 032] Nova api policy: Open up os_compute_api:os-floating-ips [puppet] - 10https://gerrit.wikimedia.org/r/350868 (https://phabricator.wikimedia.org/T164085) (owner: 10Andrew Bogott) [16:14:34] (03PS2) 10Andrew Bogott: Nova api policy: Open up os_compute_api:os-floating-ips [puppet] - 10https://gerrit.wikimedia.org/r/350868 (https://phabricator.wikimedia.org/T164085) [16:15:12] works now, no idea what happened [16:17:09] Strange thanks DatGuy [16:17:28] 06Operations, 10ops-eqiad, 10DBA: db2062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221380 (10jcrespo) [16:20:51] (03PS3) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [16:22:32] 06Operations, 10ops-eqiad, 10DBA: db2062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221420 (10jcrespo) Moving eqiad master service back to db1041. [16:24:50] 06Operations, 10ops-eqiad, 10DBA: db2062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221433 (10jcrespo) [18:18:16] jynus: raid battery [18:23:13] jynus: i have a spare [18:23:19] swapping it now [16:28:33] _joe_: corrections done on the logrotate change ^ [16:30:20] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3221480 (10Cmjohnson) 05Open>03Resolved @dzahn, the ipmi works on the host but is not reachable neodymium. This is something that is intermittent and not necessarily related to this task. There is an open ipm... [16:30:46] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3221482 (10Dzahn) ganeti2005 has been moved to A4 @ 16. switch: asw-a4-codfw: port ge-4/0/16 please configure switch [16:31:51] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3221486 (10Cmjohnson) @ottomata Let's schedule for Wednesday next week @10am EST. [16:32:25] 06Operations, 10ops-eqiad, 10DBA: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221491 (10Cmjohnson) [16:33:15] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3221494 (10Cmjohnson) [16:33:16] no errors now on start [16:33:27] I would like to restart it once more, however [16:33:30] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215445 (10Cmjohnson) [16:34:38] 06Operations, 10ops-eqiad, 10DBA: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221499 (10jcrespo) No errors on the last boot, but I would like to confirm by restarting it once more. I am doing that. [16:36:11] !log restarting db1062 once more T164092 [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:20] T164092: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092 [16:38:10] !log stopping replication on all nodes on s7-eqiad in case db1062 boots up in a corrupted state [16:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:44] 06Operations, 10ops-codfw, 10hardware-requests: decommision nembus - https://phabricator.wikimedia.org/T162928#3221515 (10Papaul) [16:42:46] 06Operations, 10ops-codfw, 10hardware-requests: decommision nembus - https://phabricator.wikimedia.org/T162928#3179655 (10Papaul) [16:42:53] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3221522 (10Marostegui) Great, we can change the DNS and that's should be it! Thanks! [16:50:41] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3221525 (10Papaul) Mainboard replacement complete. [16:51:27] (03PS1) 10Papaul: DHCP: Change MAC address for elastic2020:mainboard replaced [puppet] - 10https://gerrit.wikimedia.org/r/350882 [16:51:30] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3221526 (10Dzahn) ganeti2006 has been moved to A4 @ 17. switch: asw-a4-codfw: port ge-4/0/17 please configure switch [16:58:40] 06Operations, 10ops-codfw, 10hardware-requests: decommision nembus - https://phabricator.wikimedia.org/T162928#3179655 (10Dzahn) removed from rack [16:59:11] (03CR) 10Dzahn: [C: 032] DHCP: Change MAC address for elastic2020:mainboard replaced [puppet] - 10https://gerrit.wikimedia.org/r/350882 (owner: 10Papaul) [17:04:29] (03PS1) 10Dzahn: remove mgmt DNS for nembus (decom'ed) [dns] - 10https://gerrit.wikimedia.org/r/350883 (https://phabricator.wikimedia.org/T162928) [17:08:59] !log restarting replication on all nodes on s7-eqiad T164092 [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:09] T164092: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092 [17:09:41] Do we know what caused ~1 million jobs to be enqueued in the last hour? [17:09:46] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-2d&to=now [17:12:18] 06Operations, 10ops-eqiad, 10DBA: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221587 (10Marostegui) >>! In T164092#3221420, @jcrespo wrote: > Moving eqiad master service back to db1041. This might be confusing, should we specify that it was never done? [17:13:16] 06Operations, 10ops-eqiad, 10DBA: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3221589 (10jcrespo) You already did- I was doing it when Chris asked me to wait on IRC. [17:21:50] (03CR) 10Dzahn: [C: 032] remove mgmt DNS for nembus (decom'ed) [dns] - 10https://gerrit.wikimedia.org/r/350883 (https://phabricator.wikimedia.org/T162928) (owner: 10Dzahn) [17:21:55] (03PS2) 10Dzahn: remove mgmt DNS for nembus (decom'ed) [dns] - 10https://gerrit.wikimedia.org/r/350883 (https://phabricator.wikimedia.org/T162928) [17:24:18] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3221615 (10Dzahn) 2007 and 2008 should go into A5 then (still has to happen) [17:27:21] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decommision nembus - https://phabricator.wikimedia.org/T162928#3221623 (10Dzahn) @Robh please remove switch config (asw-b-codfw:ge-5/0/12) [17:27:37] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decommision nembus - https://phabricator.wikimedia.org/T162928#3221624 (10Dzahn) [17:28:19] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decommision nembus - https://phabricator.wikimedia.org/T162928#3179655 (10Dzahn) a:05Papaul>03RobH [17:28:23] (03PS2) 10Muehlenhoff: Fix compatability with current bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350835 [17:30:53] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3221629 (10Dzahn) @cmjohnson Yes, that is what @volans asked about on T161158#3219587. Can you apply the fix on T150160 please? [17:33:59] (03CR) 10Muehlenhoff: [C: 032] Fix compatability with current bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350835 (owner: 10Muehlenhoff) [17:39:36] RECOVERY - mysqld processes on db1040 is OK: PROCS OK: 1 process with command name mysqld [17:40:11] RECOVERY - MariaDB Slave IO: s4 on db1040 is OK: OK slave_io_state not a slave [17:40:20] RECOVERY - MariaDB Slave SQL: s4 on db1040 is OK: OK slave_sql_state not a slave [17:40:20] RECOVERY - MariaDB Slave Lag: s4 on db1040 is OK: OK slave_sql_lag not a slave [17:44:03] (03PS1) 10Rush: tools: change python-boostrap-vz to use bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350885 (https://phabricator.wikimedia.org/T157526) [17:45:44] (03CR) 10Muehlenhoff: [C: 031] tools: change python-boostrap-vz to use bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350885 (https://phabricator.wikimedia.org/T157526) (owner: 10Rush) [17:47:23] (03CR) 10Rush: [C: 032] tools: change python-boostrap-vz to use bootstrap-vz [puppet] - 10https://gerrit.wikimedia.org/r/350885 (https://phabricator.wikimedia.org/T157526) (owner: 10Rush) [18:03:49] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196736 (10jcrespo) There seems to be an increase of 1 million items in the last 12 hours, or what it is... [18:14:02] !log T163936: disabling puppet on restbase-dev1001 (t-shooting c-m-c) [18:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:11] T163936: Latency metrics missing - https://phabricator.wikimedia.org/T163936 [18:18:00] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:20:27] (03PS1) 10Joal: Update cron job copying mediawiki db into hdfs [puppet] - 10https://gerrit.wikimedia.org/r/350888 (https://phabricator.wikimedia.org/T163483) [18:24:55] (03PS7) 10Ejegg: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [18:27:55] (03CR) 10Rush: [C: 031] "andrew offered to shepherd this out later today :)" [puppet] - 10https://gerrit.wikimedia.org/r/350774 (owner: 10Faidon Liambotis) [18:32:03] (03PS2) 10Rush: Tools: Require gridengine-master for gridengine_resource [puppet] - 10https://gerrit.wikimedia.org/r/339921 (https://phabricator.wikimedia.org/T127388) (owner: 10Tim Landscheidt) [18:41:08] (03CR) 10Volans: "Thanks but this depends on a previous changes that needs to be merged. We'll ping you when ready ;)" [puppet] - 10https://gerrit.wikimedia.org/r/350774 (owner: 10Faidon Liambotis) [18:45:00] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [18:52:08] godog: you're not still around by any chance, are you? [18:53:14] (03CR) 10Jforrester: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [18:54:20] (03CR) 10Jforrester: "> Should this patch also update static/images/wikimedia-button.png , which is in the footer of a lot of projects?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [19:02:53] (03CR) 10Jforrester: [C: 04-1] "It's not clear why PS7 removed the changes to static/images/project-logos/notifications/120px-Wikimedia-logo.svg.png and static/images/pro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [19:08:59] !log T163936: reenabling puppet on restbase-dev1001 [19:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] T163936: Latency metrics missing - https://phabricator.wikimedia.org/T163936 [19:13:59] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3221912 (10jcrespo) [19:15:32] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3221926 (10jcrespo) [19:16:23] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3221912 (10jcrespo) ``` SET GLOBAL innodb_flush_log_at_trx_commit=0; SET GLOBAL sync_binlog=0; ``` Seems to be helping. I had tried disabling semi_sync replication, but that didn't work. [19:20:09] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3221935 (10jcrespo) Oh, I got it: ``` Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Ba... [19:20:16] (03PS1) 10RobH: potential ssh key compromise for ezachte [puppet] - 10https://gerrit.wikimedia.org/r/350893 [19:20:33] (03CR) 10RobH: [C: 032] potential ssh key compromise for ezachte [puppet] - 10https://gerrit.wikimedia.org/r/350893 (owner: 10RobH) [19:26:42] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3221962 (10jcrespo) The only reason I can see is: ``` Temperature: 78 C Temperature : High ``` while on db1062 I see: ``` Temperature: 47 C Temperature... [19:30:00] !log shutting down db1063 - I see high temperatures reported, and going up T164107 [19:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:09] T164107: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107 [19:30:57] (03PS3) 10Andrew Bogott: Tools: Require gridengine-master for gridengine_resource [puppet] - 10https://gerrit.wikimedia.org/r/339921 (https://phabricator.wikimedia.org/T127388) (owner: 10Tim Landscheidt) [19:33:37] (03CR) 10Andrew Bogott: [C: 032] Tools: Require gridengine-master for gridengine_resource [puppet] - 10https://gerrit.wikimedia.org/r/339921 (https://phabricator.wikimedia.org/T127388) (owner: 10Tim Landscheidt) [19:50:09] whois byron [19:51:24] :P [19:51:41] fail [19:53:20] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3222087 (10jcrespo) On boot: ``` megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature Temperature: 64 C Temperature : OK ``` [19:53:32] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3222088 (10jcrespo) ``` $ cat /sys/class/thermal/thermal_zone*/temp 61000 60000 ``` [20:00:18] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3222112 (10jcrespo) This is now ok, but it is getting hotter: ``` $ megacli -LDInfo -L0 -a0 | grep "Cache Policy:" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Cu... [20:00:37] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3222113 (10Krinkle) >>! In T156924#3220304, @Volans wrote: > 1. Should MW work seamlessly if... [20:03:11] 06Operations, 10DBA: db1063 io (s5 master eqiad) performance is bad - https://phabricator.wikimedia.org/T164107#3222123 (10Marostegui) labsdb1011 which is in the same rack: ``` Controller Temperature (C): 60 ``` [20:05:00] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222137 (10jcrespo) [20:10:15] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222162 (10jcrespo) I have forced: ``` megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll ``` The server will get fried, but at least we won't have lag. [20:18:19] If you see alters during my night from db1063 [20:18:36] it means it has been fried: https://phabricator.wikimedia.org/T164107 [20:20:47] I am checking the fans, jynus and they look good to be honest [20:22:13] there is high load [20:23:18] !log Live debug on mwdebug1001 for T164059 [20:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:28] T164059: MediaWiki\Linker\LinkRenderer::makeKnownLink() must implement interface MediaWiki\Linker\LinkTarget, null given on Special:Watchlist - https://phabricator.wikimedia.org/T164059 [20:25:38] Im just thinking out loud here but is it really a good idea to sacrifice a server just because it is lagging? [20:25:51] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222187 (10Marostegui) Fans and the other sensors look fine though: ``` 12 | Fan1 RPM | Fan | 3960.00 | RPM | 'OK' 13 | Fa... [20:34:58] (03PS1) 10Krinkle: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 [20:35:18] (03PS2) 10Krinkle: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) [20:56:49] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222282 (10Marostegui) There is nothing on the controllers' log apart from the automatic switch to WriteThrough when it first detected the BBU temp was high:... [21:10:00] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 142523.21 seconds [21:10:17] (03CR) 10Ejegg: "JForrester, I asked comms about it in the phab ticket: https://phabricator.wikimedia.org/T144254" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [21:24:51] !log End of live debug on mwdebug1001, restored previous state with a local scap pull [21:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:00] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222417 (10jcrespo) ```lines=10 MCE 0 CPU 2 THERMAL EVENT TSC 3f67e99385dbc7 TIME 1492766490 Fri Apr 21 09:21:30 2017 Processor 2 heated above trip temperatu... [21:33:42] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222422 (10jcrespo) [21:43:39] (03PS1) 10Andrew Bogott: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) [21:45:13] (03PS2) 10Andrew Bogott: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) [21:45:34] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3222474 (10Marostegui) >>! In T164107#3222417, @jcrespo wrote: > ```lines=10 > MCE 0 > CPU 2 THERMAL EVENT TSC 3f67e99385dbc7 > TIME 1492766490 Fri Apr 21 09... [21:46:14] (03CR) 10jerkins-bot: [V: 04-1] Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [21:50:34] (03PS3) 10Andrew Bogott: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) [21:52:14] (03CR) 10jerkins-bot: [V: 04-1] Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [21:54:09] (03PS4) 10Andrew Bogott: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) [21:54:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [21:55:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [21:56:30] (03CR) 10Andrew Bogott: "Advice about how to make this check less fragile is welcome! There are an awful lot of hard-coded strings in there." [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [22:00:30] 06Operations, 10Wikimedia-Apache-configuration: https://test.wikipedia.org/wiki/Bug%3F?action=history doesn't show the history page, unlike https://test.wikipedia.org/w/index.php?title=Bug%3F&action=history - https://phabricator.wikimedia.org/T123276#3222527 (10Krinkle) [22:16:59] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3222567 (10Krinkle) [22:17:44] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS (wgSecureLogin) - https://phabricator.wikimedia.org/T156320#2970877 (10Krinkle) [22:19:04] 06Operations, 06Labs: tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#3222598 (10chasemp) [22:19:16] 06Operations, 06Labs: tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#3222613 (10chasemp) p:05Triage>03Normal a:03chasemp [22:38:06] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS (wgSecureLogin) - https://phabricator.wikimedia.org/T156320#3222669 (10Krinkle) [22:46:52] (03PS5) 10Krinkle: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [22:47:56] (03CR) 10Krinkle: "LGTM actually. I don't think there's a much better way than this right now." [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [23:18:10] PROBLEM - Disk space on graphite1003 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 63959 MB (3% inode=97%) [23:22:20] (03CR) 10Krinkle: "Few thoughts regarding 2 instances." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling)