[00:10:51] thanks Niharika (and congrats! :) ) [00:13:51] (03PS1) 10Brian Wolff: Set $wgCentralAuthOldNameAntiSpoofWiki = 'metawiki'; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440034 (https://phabricator.wikimedia.org/T196386) [00:14:29] I'm going to do a security related deploy [00:15:29] ok [00:15:47] And also congrats to Niharika :) [00:16:07] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:27] (03CR) 10Brian Wolff: [C: 032] Set $wgCentralAuthOldNameAntiSpoofWiki = 'metawiki'; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440034 (https://phabricator.wikimedia.org/T196386) (owner: 10Brian Wolff) [00:20:07] (03Merged) 10jenkins-bot: Set $wgCentralAuthOldNameAntiSpoofWiki = 'metawiki'; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440034 (https://phabricator.wikimedia.org/T196386) (owner: 10Brian Wolff) [00:20:22] (03CR) 10jenkins-bot: Set $wgCentralAuthOldNameAntiSpoofWiki = 'metawiki'; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440034 (https://phabricator.wikimedia.org/T196386) (owner: 10Brian Wolff) [00:20:48] bawolff: let me know when you finish [00:20:57] ok [00:28:27] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:33:37] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [00:35:07] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [00:37:22] !log bawolff@deploy1001 Synchronized wmf-config/CommonSettings.php: Deploy I5f25c529f5bac5c (prevent users from registering previously renamed users) (duration: 00m 59s) [00:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:07] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.8/extensions/CentralAuth/extension.json: Deploy I5f25c529f5bac5c (prevent users from registering previously renamed users) (duration: 00m 57s) [00:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:31] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.8/extensions/CentralAuth/AntiSpoof/CentralAuthAntiSpoofHooks.php: Deploy I5f25c529f5bac5c (prevent users from registering previously renamed users) (duration: 00m 57s) [00:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:41] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.7/extensions/CentralAuth/extension.json: Deploy I5f25c529f5bac5c (prevent users from registering previously renamed users) (duration: 00m 58s) [00:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:07] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.7/extensions/CentralAuth/AntiSpoof/CentralAuthAntiSpoofHooks.php: Deploy I5f25c529f5bac5c (prevent users from registering previously renamed users) (duration: 00m 57s) [00:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:01] AaronSchulz: I'm done [00:47:06] k [00:47:28] AaronSchulz: I just cherry-picked my change on to the deployment host in order to avoid pulling your change - I hope that's ok [00:48:47] bawolff: so the HEAD..origin one is just the same thing (submodule update)? [00:48:58] e.g. e738819eeaab225229a0b6c6c06a5b0de8017ef5 [00:49:06] yes [00:49:18] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [00:52:37] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:53:39] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php: c2df9668d13 (duration: 00m 58s) [00:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:25] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.8/tests/phpunit/includes/db/LBFactoryTest.php: (no justification provided) (duration: 00m 58s) [00:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:47] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [00:58:48] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [01:00:32] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.7/includes/libs/rdbms/lbfactory/LBFactory.php: f83bad65fce6e6e (duration: 00m 59s) [01:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:50] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.7/tests/phpunit/includes/db/LBFactoryTest.php: (no justification provided) (duration: 00m 57s) [01:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:14] (03CR) 10Aaron Schulz: [C: 032] Add "memcached-mcrouter" to $wgObjectCaches as default for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436252 (owner: 10Aaron Schulz) [01:10:50] (03Merged) 10jenkins-bot: Add "memcached-mcrouter" to $wgObjectCaches as default for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436252 (owner: 10Aaron Schulz) [01:11:03] (03CR) 10jenkins-bot: Add "memcached-mcrouter" to $wgObjectCaches as default for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436252 (owner: 10Aaron Schulz) [01:12:54] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Add "memcached-mcrouter" to $wgObjectCaches as default for testwiki (duration: 00m 58s) [01:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:03] _joe_: ^ deployed that mcrouter patch fyi [01:29:34] (03PS1) 10Aaron Schulz: [DNM] Set "mcrouterAware" flag for "memcached-mcrouter" object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440039 [01:31:20] * AaronSchulz wonders what wikis to later do next [02:18:28] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [02:21:48] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:31:21] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 12m 21s) [02:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:18] (03PS1) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [02:33:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [03:00:02] Did gerrit crash? [03:00:49] (03PS6) 1020after4: Configuration for phabricator to use swift storage. [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) [03:01:00] It still seems to be working, but when I log into it my account is totally empty and I can't do a git pull from it. [03:04:34] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 15m 45s) [03:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:36] (03CR) 1020after4: [C: 031] Configuration for phabricator to use swift storage. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [03:06:20] (03PS7) 1020after4: Configuration for phabricator to use swift storage. [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) [03:08:36] (03PS2) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [03:09:24] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [03:14:53] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jun 13 03:14:53 UTC 2018 (duration 10m 19s) [03:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:25] Hello everyone [04:06:26] I would like to ask how to join the wikitech technical team to learn from you? [04:09:53] I come from China and work on technology. I hope to join you in my spare time. I will contribute to wikitech in the future. I also hope to learn from your seniors. [04:21:57] (03PS3) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:22:32] Andy_: https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker is a bit of an overview we've put together for that question [04:27:17] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a [04:27:17] ved [04:28:17] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [04:29:43] (03CR) 1020after4: [C: 031] "Now that I've added swift_key_codfw and swift_key_eqiad, the puppet compiler is no longer able to compile this change: http://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [04:31:38] (03PS4) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:35:56] How do I need to join the Operations Team Thank you [04:43:47] PROBLEM - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 22 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops [04:48:32] Andy_: the operations team in particular there are various options, but some people in the past have started by helping out with our public cloud infra, in #wikimedia-cloud [04:49:15] this time of day is generally not great though, generally more people are arround 800-2400 UTC [04:49:58] (03PS5) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:50:07] ebernhardson:Now your Operations Team does not allow new joins? Can only join wikimedia-cloud? [04:50:47] Andy_: cloud is part of operations, but generally few people start on the production facing infrastructure [04:51:33] on the cloud side there is much more involvement of the wider technical community, and helping them run their tools/bots/etc on our public infra [04:51:50] OK, then how do you join the cloud team? [04:52:30] you would have to talk to them, when they are around, to get an idea of some small projects to start with [04:56:17] OK, Thank you [04:56:55] https://phabricator.wikimedia.org/p/AndyTan/ Please contact me, thank you [05:03:55] !log Deploy schema change on dbstore1001:s1 T191316 T192926 T89737 T195193 [05:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:03] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:04:03] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:04:03] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:04:04] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:06:53] (03PS6) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [05:07:05] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277822 (10Marostegui) >>! In T196840#4277313, @mmodell wrote: > @marostegui: I canceled some of the queued jobs which should have helped somewhat. The only thing I know to do... [05:09:45] !log Disable gtid on db1066 [05:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:54] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277823 (10mmodell) I've got the queue down to 3.1M by canceling jobs. There is still write traffic involved even to delete the jobs so it hasn't really reduced the traffic as... [05:11:05] !log Starting topology changes in order to get ready for s2 failover - T194870 [05:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:10] T194870: Failover s2 primary master - https://phabricator.wikimedia.org/T194870 [05:12:54] (03PS7) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [05:12:58] <_joe_> Andy_: great! [05:13:08] <_joe_> err [05:13:12] <_joe_> AaronSchulz: great! [05:13:50] <_joe_> sorry for the random ping Andy_ [05:19:10] (03PS3) 10Marostegui: mariadb: Promote db1066 to master [puppet] - 10https://gerrit.wikimedia.org/r/439530 (https://phabricator.wikimedia.org/T194870) [05:20:02] (03CR) 10Marostegui: [C: 032] mariadb: Promote db1066 to master [puppet] - 10https://gerrit.wikimedia.org/r/439530 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:20:46] (03PS3) 10Marostegui: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) [05:23:38] (03PS3) 10Marostegui: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) [05:23:50] (03PS3) 10Marostegui: wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/439533 (https://phabricator.wikimedia.org/T194870) [05:24:21] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:39:56] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277839 (10jcrespo) p:05High>03Normal I don't think this is high from our perspective- they have dedicated db resources and the replica is up to data, and were aware of the... [05:45:23] (03PS4) 10Jcrespo: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:45:48] (03CR) 10Jcrespo: "on going -> ongoing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:46:28] (03CR) 10Marostegui: [C: 04-2] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:46:38] (03CR) 10Marostegui: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:55:36] We are taking over deploy1001 for now, we are going to star the s2 failover soon [05:56:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:57:07] (03CR) 10Marostegui: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:58:10] (03Merged) 10jenkins-bot: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:58:22] (03CR) 10jenkins-bot: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [05:59:20] jynus: ready? [05:59:47] yeah, let the bot announce the window [05:59:55] :) [06:00:04] jynus and marostegui: I, the Bot under the Fountain, allow thee, The Deployer, to do Database Maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T0600). [06:00:16] !log Starting s2 failover from db1054 to db1066 - T194870 [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:21] T194870: Failover s2 primary master - https://phabricator.wikimedia.org/T194870 [06:00:30] (03Abandoned) 10Chad: Apache redirects: keep query string attached [puppet] - 10https://gerrit.wikimedia.org/r/429447 (owner: 10Chad) [06:01:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s2 on read-only for primary db master maintnance - T194870 (duration: 01m 08s) [06:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:00] jynus: check positions [06:02:04] confirm read only [06:02:27] db1054-bin.004390 314943534 [06:02:34] yep [06:03:04] db1066-bin.000084 435382603 [06:03:06] go on [06:03:10] yep [06:03:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:03:17] (03PS4) 10Marostegui: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) [06:04:04] waiting for jenkins now [06:04:15] some replication errors as expected happened [06:04:28] don't wait [06:04:32] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:04:33] just force it [06:04:48] I already +1'ed it [06:05:00] deploying [06:05:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove read only from s2 - T194870 (duration: 00m 34s) [06:05:32] we are back, let's check [06:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:36] T194870: Failover s2 primary master - https://phabricator.wikimedia.org/T194870 [06:06:41] I still get reaad only message [06:06:55] me too [06:07:09] we need another commit [06:07:12] to remove ro [06:07:34] yeah [06:07:35] doing it [06:07:40] (03PS1) 10Marostegui: db-eqiad.php: Remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440055 [06:07:57] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440055 (owner: 10Marostegui) [06:07:59] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:08:01] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440055 (owner: 10Marostegui) [06:08:13] deploying [06:08:37] let's re check [06:08:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove read only from s2 - T194870 (duration: 00m 33s) [06:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:57] works for me now [06:08:58] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277853 (10mmodell) The gerrit notedb migration was a one time event, so it shouldn't really be something that happens with every update. [06:09:03] looking good [06:10:01] yeah [06:10:03] I can edit just fine [06:11:13] I will keep on with the follow up tasks [06:11:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:12:04] that is the spike from the read only [06:12:08] but it looks gone [06:12:11] yes [06:12:33] (03CR) 10jenkins-bot: db-eqiad.php: Remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440055 (owner: 10Marostegui) [06:12:52] I don't see any issues- let's reset slave on new master and move the old master [06:12:58] and update tendril [06:13:01] yeah, doing that already [06:13:05] let me know if you want me to do some of that [06:13:11] no, no worries [06:13:14] I will take care of it [06:13:46] semi sync was done beforehand, but requires checking the ones missing [06:13:56] i will check the replicas [06:14:15] db1124 (s3) is complaining [06:14:23] about? [06:14:28] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.02 seconds [06:14:32] that^ [06:14:54] maybe anomie's script hitting s3 [06:14:57] it is not mediawiki, so we may be ok [06:15:06] yeah, it is a sanitarium [06:15:06] it is codfw and labs [06:15:18] then most likely the script [06:15:20] I will check later [06:15:20] (what a bad timing :-) [06:15:50] confirm position and log file for db1054 to start replicating from db1066: db1066-bin.000084 435382603 [06:15:54] jynus: ^ [06:15:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:16:00] (just wanting another pair of eyes) [06:16:21] make sure you are running that on db1054 [06:16:26] yep:) [06:16:44] and yes, coords are ok [06:16:47] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [06:19:07] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [06:19:20] (03CR) 10Marostegui: [C: 032] s2.hosts: db1066 is now s2 primary master [software] - 10https://gerrit.wikimedia.org/r/439534 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:19:53] I can see dbtree updated [06:19:58] I just did it [06:20:23] (03CR) 10Marostegui: [C: 032] wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/439533 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:20:25] (03Merged) 10jenkins-bot: s2.hosts: db1066 is now s2 primary master [software] - 10https://gerrit.wikimedia.org/r/439534 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [06:21:07] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 340.55 seconds [06:22:09] db1054 lag is increasing, is it replicating? [06:22:27] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:22:35] apparently yes [06:23:06] it is [06:23:22] it may have issues catching up, for several reasons [06:23:29] including the bbu :( [06:23:29] (hard, version, etc.) [06:23:36] I will force write-back [06:23:44] actually [06:23:47] it is ok [06:24:07] we should just ack the lag and eventually stop it [06:24:17] yep [06:24:23] we need to find another candidate master [06:26:17] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [06:26:50] actually, semisync is disabled everywhere [06:26:56] so not sure why it reported 4 clients [06:27:10] (03PS2) 10Elukey: profile::geowiki: remove unused/old crons [puppet] - 10https://gerrit.wikimedia.org/r/439529 [06:27:12] I have been running all the stop and start [06:27:18] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [06:27:19] and the check reports more than 1 client [06:27:31] oh, I see [06:27:35] it is enabled, yes [06:27:44] 5 clients now [06:27:47] yeah [06:28:11] I guess you didn't enable it on db1054 [06:28:17] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:28:17] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:28:19] (03CR) 10Elukey: [C: 032] "Even if not complete, it will remove daily cronspam to analytics :)" [puppet] - 10https://gerrit.wikimedia.org/r/439529 (owner: 10Elukey) [06:28:22] no, not yet :) [06:30:58] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/root/.screenrc] [06:31:17] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:31:29] 10Operations, 10ops-esams, 10netops: cp3036 and cp3037 production ports mislabeled - https://phabricator.wikimedia.org/T196970#4277861 (10ayounsi) 05Open>03Resolved a:03ayounsi Thanks, fixed: ```lang=diff [edit interfaces xe-3/0/4] - description cp3037; + description cp3036; [edit interfaces xe-3/0... [06:31:48] I will close T195487 ? [06:31:48] T195487: Announce read-only time for wikis on s2 for 13th June 2018 - https://phabricator.wikimedia.org/T195487 [06:32:07] yeah, maybe also put this: https://phabricator.wikimedia.org/T194870#4277864 [06:34:01] I am going to force write-back on db1054 so at least it is up to date, just in case [06:34:56] ok [06:35:22] done [06:35:24] we can also reduce consistency- doesn't hurt if we do the otehr change [06:35:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665#4277869 (10ayounsi) [06:35:26] and it is catching up quickly now [06:35:27] 10Operations, 10ops-codfw, 10netops: switch port configuration for bast2002 - https://phabricator.wikimedia.org/T196957#4277866 (10ayounsi) 05Open>03Resolved a:03ayounsi Added to the public vlan: ```lang=diff [edit interfaces interface-range vlan-public1-b-codfw] member ge-8/0/12 { ... } + mem... [06:35:57] but let's create a ticket to decom + setup another candidate [06:36:07] yeah I was creating the decommissioning one now [06:36:38] I would add there the candidate, as it would be technically part of the failover procfess [06:36:43] yeah [06:36:47] as a decomm step, I mean [06:37:33] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4245147 (10elukey) Yep I think that uploading the new version (cassandra-2.2.6-wmf5) to the cassandra22 component shou... [06:41:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440057 [06:41:26] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440057 [06:43:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440057 (owner: 10Marostegui) [06:45:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440057 (owner: 10Marostegui) [06:46:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 after alter table (duration: 00m 59s) [06:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440057 (owner: 10Marostegui) [06:56:27] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] (03PS1) 10Muehlenhoff: Switch Chad to volunteer, now has signed the NDA [puppet] - 10https://gerrit.wikimedia.org/r/440060 [06:58:47] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:58:47] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4277934 (10Dzahn) @mepps Sorry, i don't know very much about JupyterHub. What I do know though is that the docs say "You will need production access (ask for... [07:01:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4277935 (10Dzahn) 05Resolved>03Open [07:01:46] (03CR) 10Muehlenhoff: [C: 032] Switch Chad to volunteer, now has signed the NDA [puppet] - 10https://gerrit.wikimedia.org/r/440060 (owner: 10Muehlenhoff) [07:01:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140052 (10Dzahn) a:05Dzahn>03None [07:02:38] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140052 (10Dzahn) [07:03:16] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4140052 (10Dzahn) [07:04:45] (03PS2) 10Giuseppe Lavagetto: jobrunner: reduce to one redis server per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/439944 (https://phabricator.wikimedia.org/T197003) [07:14:50] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4277952 (10Lea_WMDE) [07:18:13] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: reduce to one redis server per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/439944 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [07:24:54] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#4277981 (10hashar) [07:31:19] (03PS1) 10Volans: cumin: simplify role description for MOTD [puppet] - 10https://gerrit.wikimedia.org/r/440062 [07:31:21] ema: ^^^^ :-P [07:32:29] (03PS1) 10Giuseppe Lavagetto: jobrunner: remove other references to servers we're removing [puppet] - 10https://gerrit.wikimedia.org/r/440063 [07:32:31] (03CR) 10Ema: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/440062 (owner: 10Volans) [07:32:38] <_joe_> wait for merging, please [07:33:06] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: remove other references to servers we're removing [puppet] - 10https://gerrit.wikimedia.org/r/440063 (owner: 10Giuseppe Lavagetto) [07:33:17] <_joe_> go on now :) [07:33:19] !log start removing ms-be1036 from swift rings - T196873 [07:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:23] T196873: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 [07:33:34] <_joe_> we should really, really get away from ff-only [07:33:48] <_joe_> this week I had to do quite some work on puppet doing a ton of commits [07:33:58] <_joe_> and it took away all the rhythm from my work [07:34:54] <_joe_> I argued multiple times to change that policy, by now I exhausted the energies to argue anymore [07:35:11] <_joe_> I just say ff-only is a huge hindrance to our productivity, with no real benefit [07:35:25] (03Abandoned) 10Dzahn: remove mariadb includes from mw-maintenance role [puppet] - 10https://gerrit.wikimedia.org/r/437382 (owner: 10Dzahn) [07:38:49] kaldari: some how gerrit has created you two account [07:39:02] Which should be impossible but apparently some how it has happened [07:39:41] (03PS1) 10Marostegui: db1054.yaml: Update socket [puppet] - 10https://gerrit.wikimedia.org/r/440064 [07:40:06] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4278018 (10ayounsi) [07:40:26] (03PS2) 10Marostegui: db1054.yaml: Update socket [puppet] - 10https://gerrit.wikimedia.org/r/440064 [07:41:25] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4273636 (10ayounsi) a:03Papaul Switch ports configured, table in description updated. [07:41:48] !log Stop MySQL on db1054 for socket update and binlog change [07:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:19] (03CR) 10Marostegui: [C: 032] db1054.yaml: Update socket [puppet] - 10https://gerrit.wikimedia.org/r/440064 (owner: 10Marostegui) [07:46:17] (03PS1) 10Marostegui: db-eqiad.php: Restore s2 default read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440066 [07:47:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore s2 default read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440066 (owner: 10Marostegui) [07:49:21] (03Merged) 10jenkins-bot: db-eqiad.php: Restore s2 default read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440066 (owner: 10Marostegui) [07:49:34] (03CR) 10jenkins-bot: db-eqiad.php: Restore s2 default read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440066 (owner: 10Marostegui) [07:50:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440067 (https://phabricator.wikimedia.org/T191316) [07:50:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore s2 default read-only message (duration: 00m 57s) [07:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:50] (03CR) 10Gehel: [C: 04-1] "A few comments inline. I'll play a bit with the code before sending more comments (or directly a patch)." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [07:52:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:54:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:54:10] (03CR) 10Elukey: [C: 031] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [07:54:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:54:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Reduce the jobqueue redis to use just one server per dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [07:55:17] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.00 seconds [07:55:17] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.09 seconds [07:55:18] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.04 seconds [07:55:27] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.46 seconds [07:55:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 for alter table (duration: 00m 58s) [07:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:36] !log Deploy schema change on db1089 T191316 T192926 T89737 T195193 [07:55:38] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.12 seconds [07:55:38] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.65 seconds [07:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:42] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:55:42] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:55:42] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:55:43] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:55:48] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.40 seconds [07:55:58] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 349.08 seconds [07:56:25] (03Merged) 10jenkins-bot: Reduce the jobqueue redis to use just one server per dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [07:57:43] !log restart cpjobqueue on scb1001 cause, it's lost all it's workers [07:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:29] (03CR) 10jenkins-bot: Reduce the jobqueue redis to use just one server per dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [07:58:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440068 (https://phabricator.wikimedia.org/T197063) [07:59:34] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: Remove unused redis shards from the jobqueue T197003 (duration: 00m 58s) [07:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:39] T197003: Dismantle most of the old jobqueue infrastructure - https://phabricator.wikimedia.org/T197003 [08:00:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440068 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [08:00:38] (03PS1) 10Ayounsi: Add Icinga alert for a Grafana traffic dashboard [puppet] - 10https://gerrit.wikimedia.org/r/440069 [08:01:09] (03PS2) 10Jcrespo: mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961 [08:01:11] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga alert for a Grafana traffic dashboard [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:01:51] (03PS1) 10Dzahn: rm mwmaint1001.yaml - activate mariadb::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/440070 (https://phabricator.wikimedia.org/T192092) [08:02:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440068 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [08:02:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440068 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [08:02:52] (03PS2) 10Dzahn: rm mwmaint1001.yaml - activate mariadb::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/440070 (https://phabricator.wikimedia.org/T192092) [08:03:58] (03CR) 10Dzahn: [C: 032] ""common" will set this to enabled" [puppet] - 10https://gerrit.wikimedia.org/r/440070 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [08:04:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 for binlog change - T197063 (duration: 00m 57s) [08:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] T197063: Decommission db1054 - https://phabricator.wikimedia.org/T197063 [08:04:28] (03PS3) 10Jcrespo: mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961 [08:04:35] !log Stop MySQL and reboot db1076 - T197063 [08:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:34] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 [08:09:35] PROBLEM - DPKG on restbase2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:10:35] RECOVERY - DPKG on restbase2001 is OK: All packages OK [08:12:26] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1980 bytes in 0.078 second response time [08:12:35] <_joe_> mutante ^^ [08:12:46] <_joe_> dunno if we switched something already [08:13:45] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [08:13:49] _joe_: no, we didn't switch anything yet that would affect wikidata [08:14:10] i just sent a mail a minute ago that i would like to do it tomorrow [08:14:27] about to talk to Ladsgroup about the wikidata part specifically [08:14:30] that happens all the time, I don't think most of the time is a real failure, but a monitoring failure [08:14:45] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [08:15:19] yea, www.wikidata.org seems to have normal content [08:15:27] re: "pattern not found" [08:15:44] <_joe_> mutante: that's a bad error message [08:16:33] still, a bug on monitoring [08:16:37] <_joe_> jynus: no it's a real alert, it means the lag exceeds 300 seconds [08:16:45] ^ [08:16:48] <_joe_> the error message is a standard icinga one [08:17:00] <_joe_> because we use check_http_regexp [08:17:13] if it says "pattern not found" then check_http is looking for a string [08:17:26] <_joe_> it's looking for a pattern [08:17:26] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.093 second response time [08:17:29] I will in the future suggest very small changes to our alter messages [08:17:46] e.g. alert name "wikidata despatch lag" [08:17:51] <_joe_> this is just a standard icinga check, we need to write our own if we want a better message [08:18:03] <_joe_> we might start by not scraping a web page [08:18:11] error "lag is > 300s: Xs" [08:18:17] +1 [08:18:31] I have other examples that are even easier [08:18:46] remove all "check" from check names [08:19:07] and the details about the implementation from the name [08:19:44] I will send some patches when I have the time [08:24:45] so the actual page we check is https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics and i see what you mean now after looking at the --ereg parameter of the actual check command [08:25:05] -ereg '"median":[^}]*"lag":([1-2]?[0-9]?[0-9]|300),' [08:26:51] (03CR) 10Ayounsi: [V: 032] "Adding reviewers and forcing jenkins' +2." [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:27:08] lag cant fall under 100 i guess [08:31:32] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440074 (https://phabricator.wikimedia.org/T191298) [08:32:46] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440074 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [08:33:11] (03PS2) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440074 (https://phabricator.wikimedia.org/T191298) [08:34:16] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440074 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [08:41:00] _joe_: I guess I can merge now right? :) [08:41:06] (03PS2) 10Volans: cumin: simplify role description for MOTD [puppet] - 10https://gerrit.wikimedia.org/r/440062 [08:41:28] <_joe_> volans: yes [08:41:46] (03CR) 10Volans: [C: 032] cumin: simplify role description for MOTD [puppet] - 10https://gerrit.wikimedia.org/r/440062 (owner: 10Volans) [08:41:49] thx [08:44:02] 10Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#4278185 (10fgiunchedi) p:05Normal>03High There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist ``` Jun 13 04:... [08:46:16] !log depool cp1053 T165252 [08:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:21] T165252: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252 [08:46:28] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440076 (https://phabricator.wikimedia.org/T191298) [08:47:03] (03CR) 10Giuseppe Lavagetto: systemd: add define specific to timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [08:47:33] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440076 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [08:47:56] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 166 ge 4 Ema Host depooled T165252 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops [08:48:56] (03PS5) 10Giuseppe Lavagetto: systemd: add define specific to timers [puppet] - 10https://gerrit.wikimedia.org/r/417948 [08:49:57] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440077 (https://phabricator.wikimedia.org/T191298) [08:51:00] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440077 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [08:52:28] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd: add define specific to timers [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [08:53:18] _joe_: I had comments [08:53:22] reviewing now [08:53:38] (03PS8) 10Gehel: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [08:53:44] <_joe_> volans: ah sorry, I got the green light from vgutierrez [08:53:54] <_joe_> but since this is still unused, we have all the time to change things [08:54:02] (03CR) 10Volans: "some comment/question inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [08:54:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [08:54:54] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4278217 (10jcrespo) [08:55:09] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4278232 (10jcrespo) [08:55:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3857747 (10jcrespo) [08:57:18] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd: add define specific to timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [08:57:57] <_joe_> volans: why would we need to add 'provider' => 'systemd'? [08:58:23] it was needed in systemd::service { 'update-etcd-mw-config-lastindex.timer': [08:58:37] <_joe_> do you remember why? [08:58:57] not exactly, I can try to reconstruct it [08:59:11] <_joe_> it's a puppet bug, in that case [08:59:18] <_joe_> Oh I just remembered [08:59:29] <_joe_> puppet still defaults to using "service" on debian [08:59:31] <_joe_> IIRC [09:00:01] <_joe_> which ofc doesn't work with non-service units [09:00:05] maybe we refactored systemd::service in the meanwhile, don't remeber [09:00:15] <_joe_> actually, I think we should change systemd::service [09:00:27] ahhh maybe Systemd::Unit_type doesn't have timers? [09:00:50] !log Restart mysql on dbstore1002 for maintenance [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:54] because if you can pass $unit_type = 'timer' seems that it will use systemd [09:00:56] <_joe_> volans: nope, it's there [09:00:56] elukey: ^ [09:00:57] so maybe was for that [09:01:01] then dunno [09:01:10] <_joe_> paravoid maybe remembers more about the service provider in puppet and debian [09:01:11] marostegui: <3 [09:01:20] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4278273 (10jcrespo) [09:01:23] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4278272 (10jcrespo) [09:01:36] <_joe_> volans: yes we fixed systemd::service in the meanwhile [09:01:46] <_joe_> # Force the provider of the service to be systemd if the unit type is [09:01:49] ah ok [09:01:50] <_joe_> # not service. Otherwise, they'd fail on at least debian jessie [09:01:54] <_joe_> :) [09:04:31] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4278276 (10jcrespo) [09:05:07] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3857747 (10jcrespo) [09:06:20] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4278296 (10jcrespo) p:05Low>03Normal Not low anymore, based on my proposal of 1 server movement. [09:06:52] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4278310 (10jcrespo) p:05Triage>03Normal [09:07:06] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [09:09:16] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4218.97 seconds [09:09:25] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2254.77 seconds [09:14:58] (03CR) 10Volans: "thanks for the replies" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [09:15:24] _joe_: so now you can refactor icinga::monitor::etcd_mw_config to use the new timer as a test ;) [09:15:32] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [09:16:54] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:23:53] (03PS2) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [09:25:03] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:26:00] (03CR) 10Volans: "nit inline" (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:26:31] (03CR) 10Nehajha: ">" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [09:34:20] (03PS9) 10Gehel: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [09:35:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [09:35:44] (03PS3) 10Nehajha: Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) [09:36:03] (03PS3) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [09:37:25] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:41:17] (03PS10) 10Gehel: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [09:42:29] (03PS4) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [09:43:45] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:47:10] (03PS11) 10Gehel: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [09:47:20] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278435 (10fgiunchedi) I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking... [09:48:26] (03PS7) 10Elukey: Move the varnishkafka submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) [09:48:28] (03PS3) 10Elukey: Move the kafkatee submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437950 (https://phabricator.wikimedia.org/T188377) [09:48:30] (03PS3) 10Elukey: Move the jmxtrans submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437951 (https://phabricator.wikimedia.org/T188377) [09:48:32] (03PS1) 10Elukey: Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/440080 (https://phabricator.wikimedia.org/T188377) [09:52:33] (03CR) 10jerkins-bot: [V: 04-1] Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/440080 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [09:53:12] (03PS5) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [09:54:17] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:56:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [09:56:55] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [09:58:38] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [10:00:06] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 1.63 seconds [10:01:06] (03PS1) 10Ema: vcl: remove 3DES deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/440087 (https://phabricator.wikimedia.org/T147199) [10:03:02] (03PS6) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [10:04:03] (03PS14) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [10:04:11] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [10:09:33] (03PS1) 10Volans: Improve logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440088 (https://phabricator.wikimedia.org/T191299) [10:09:37] 10Operations, 10monitoring: Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084#4278480 (10fgiunchedi) [10:09:43] 10Operations, 10Wikimedia-Planet: Only include the last e.g. 6 months of news - https://phabricator.wikimedia.org/T196965#4278493 (10Dzahn) [10:10:40] (03CR) 10jerkins-bot: [V: 04-1] Improve logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440088 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:10:56] !log upload apertium-apy_0.11.3-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [10:10:58] kart_: ^ [10:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:25] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:12:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [10:12:55] (03PS7) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) [10:14:05] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [10:17:24] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4096194 (10Dzahn) gerrit.wmfusercontent.org now exists in cache::misc and requests would be forwarded to cobalt as the backend. This unblocked this to a certain extent because avatar... [10:17:31] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4278528 (10Dzahn) p:05Triage>03Normal [10:18:58] (03CR) 10Elukey: "BBlack: I tried to send a first attempt of code change on top of the other ones that I made. The assumptions were:" [puppet] - 10https://gerrit.wikimedia.org/r/440080 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [10:19:01] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278534 (10akosiaris) >>! In T183177#4278435, @fgiunchedi wrote: > I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine che... [10:22:52] 10Operations, 10monitoring: Report problems found by mcelog - https://phabricator.wikimedia.org/T197086#4278539 (10fgiunchedi) [10:24:16] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278560 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm resolving this task since we're alerting on uncorrectable memory errors found by EDAC now. Uncorre... [10:24:33] akosiaris: ugh sorry I resolved before reading your comment :( [10:25:30] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278566 (10fgiunchedi) 05Resolved>03Open [10:26:33] 10Operations: Upgrading python-requests on trusty - https://phabricator.wikimedia.org/T197088#4278567 (10MoritzMuehlenhoff) [10:28:19] godog: that's fine I did not have much of an input. I am mostly confused as to what should be happening if an UE shows up [10:28:33] will the box panic ? or will the process get a SIGBUS ? [10:33:49] (03PS9) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:34:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update apertium-apy initscripts (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [10:35:25] (03CR) 10Alexandros Kosiaris: [C: 031] Improve logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440088 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:36:11] (03CR) 10Jcrespo: "Example usage: https://phabricator.wikimedia.org/P7254" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:40:06] akosiaris: heh afaict "it depends", I'll answer on the task [10:40:45] lol [10:47:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440090 [10:47:48] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440090 [10:50:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440090 (owner: 10Marostegui) [10:51:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440090 (owner: 10Marostegui) [10:52:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440090 (owner: 10Marostegui) [10:53:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440091 (https://phabricator.wikimedia.org/T191316) [10:53:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1089 after alter table (duration: 00m 59s) [10:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:27] (03PS2) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440091 (https://phabricator.wikimedia.org/T191316) [10:56:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440091 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:57:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440091 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:59:09] (03PS2) 10Marostegui: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 [10:59:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 for alter table (duration: 00m 58s) [10:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:19] !log Deploy schema change on db1105:3311 T191316 T192926 T89737 T195193 [10:59:24] (03CR) 10Urbanecm: "@MarcoAurelio: Ad the invalid account "priv", it was added before, at the bottom of the section. I find this location confusing, so I move" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [10:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:25] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:59:25] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:59:26] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:59:26] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [11:01:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440091 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [11:01:35] (03CR) 10Urbanecm: "IMO out of scope here. Moving to https://phabricator.wikimedia.org/T197095." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [11:03:30] (03CR) 10Volans: [V: 032 C: 032] Improve logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440088 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:05:34] (03PS1) 10Urbanecm: Clear duplicate right specification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) [11:06:30] (03PS1) 10Volans: Client CLI: change versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440093 (https://phabricator.wikimedia.org/T191300) [11:07:16] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [11:07:16] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [11:07:31] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: change versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440093 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:10:45] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [11:13:19] akosiaris: Thanks. Will address comments soon. [11:17:52] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4278789 (10awight) Adding onto @Halfak's comments, I agree that social convention seems to be the best way to protect against runaway JADE usage. Specifi... [11:21:06] !log installing perl security updates [11:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:44] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4278809 (10elukey) Seems to work fine in labs, I am ok with uploading it to reprepro (cassandra22 component) for jessi... [11:24:48] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4278810 (10awight) @Ladsgroup it would be great if you could weigh in with your concerns, if you still think we shouldn't deploy? [11:27:16] (03PS3) 10KartikMistry: Update apertium-apy initscripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) [11:30:57] (03PS3) 10Alexandros Kosiaris: scaffolding: Disabling monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434476 [11:31:11] (03PS1) 10Volans: Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) [11:32:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] scaffolding: Disabling monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434476 (owner: 10Alexandros Kosiaris) [11:32:24] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:33:05] (03PS2) 10Alexandros Kosiaris: Allow autoallocate service port, use it under minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/434924 [11:33:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Allow autoallocate service port, use it under minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/434924 (owner: 10Alexandros Kosiaris) [11:33:58] (03CR) 10Muehlenhoff: [C: 031] Client CLI: change versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440093 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:35:27] (03CR) 10Alexandros Kosiaris: [C: 031] Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:36:25] (03CR) 10Alexandros Kosiaris: [C: 031] Client CLI: change versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440093 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:39:22] (03CR) 10Volans: [V: 032 C: 032] Client CLI: change versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440093 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:39:54] (03PS2) 10Volans: Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) [11:41:00] (03CR) 10Muehlenhoff: [C: 031] Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:41:10] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:41:38] (03CR) 10Volans: [V: 032 C: 032] Client CLI: use backward compatible requests syntax [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440095 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:47:56] (03PS1) 10Muehlenhoff: Add a .gitreview file [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440097 [11:48:41] (03PS1) 10Ema: varnish: install libvmod-re2 [puppet] - 10https://gerrit.wikimedia.org/r/440098 (https://phabricator.wikimedia.org/T164609) [11:48:55] (03PS1) 10Dzahn: mw-maintenance: rsync home dirs from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440099 (https://phabricator.wikimedia.org/T192092) [11:49:01] (03CR) 10jerkins-bot: [V: 04-1] Add a .gitreview file [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440097 (owner: 10Muehlenhoff) [11:49:16] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [11:49:58] (03PS1) 10Dzahn: mw-deployment: remove rsync for tin home dirs [puppet] - 10https://gerrit.wikimedia.org/r/440100 [11:51:00] (03CR) 10Ema: [C: 032] varnish: install libvmod-re2 [puppet] - 10https://gerrit.wikimedia.org/r/440098 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [11:51:08] (03CR) 10Volans: [C: 031] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440097 (owner: 10Muehlenhoff) [11:52:36] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:52:58] (03PS2) 10Dzahn: mw-deployment: remove rsync for tin home dirs [puppet] - 10https://gerrit.wikimedia.org/r/440100 (https://phabricator.wikimedia.org/T175288) [11:53:33] (03CR) 10Dzahn: [C: 032] "has been done. tin will be decom'ed. not needed anymore" [puppet] - 10https://gerrit.wikimedia.org/r/440100 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [11:53:50] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add a .gitreview file [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440097 (owner: 10Muehlenhoff) [11:54:18] (03PS2) 10Dzahn: mw-maintenance: rsync home dirs from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440099 (https://phabricator.wikimedia.org/T192092) [11:58:33] (03Abandoned) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440076 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [11:58:43] (03Abandoned) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440077 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [12:00:04] (03CR) 10Volans: [C: 031] "LGTM, couple of nit inline" (032 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [12:00:35] (03PS2) 10WMDE-Fisch: Enable FileImporter monolog channel in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370) (owner: 10Addshore) [12:04:44] (03CR) 10Muehlenhoff: Add initial Debianisation of debmonitor-client (032 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [12:06:15] PROBLEM - Device not healthy -SMART- on db2052 is CRITICAL: cluster=mysql device=cciss,1 instance=db2052:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw%2520prometheus%252Fops [12:06:21] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440078 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [12:11:35] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [12:12:36] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [12:14:21] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable keystone local database [puppet] - 10https://gerrit.wikimedia.org/r/440102 (https://phabricator.wikimedia.org/T196633) [12:15:32] (03PS1) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 [12:16:29] !log T196633 extend downtime for labcontrol1003 [12:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:35] T196633: cloudvps: eqiad1 deployment - https://phabricator.wikimedia.org/T196633 [12:16:48] (03PS2) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 [12:17:11] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable keystone local database [puppet] - 10https://gerrit.wikimedia.org/r/440102 (https://phabricator.wikimedia.org/T196633) [12:17:55] (03PS15) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [12:18:52] (03PS16) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [12:19:47] (03PS4) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808 [12:19:56] (03PS5) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808 [12:21:14] (03CR) 10Arturo Borrero Gonzalez: [V: 031] "https://puppet-compiler.wmflabs.org/compiler02/11485/" [puppet] - 10https://gerrit.wikimedia.org/r/440102 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:21:36] (03PS6) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) [12:22:30] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/440104 (owner: 10Paladox) [12:24:45] (03CR) 10Arturo Borrero Gonzalez: [V: 031 C: 032] openstack: eqiad1: enable keystone local database [puppet] - 10https://gerrit.wikimedia.org/r/440102 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:28:19] !log rolling restart of kafka on kafka1012->23 for openjdk-7 upgrades [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:15] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: bootstrap keystone [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196663) [12:39:32] (03CR) 10Vgutierrez: vcl: remove 3DES deprecation warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440087 (https://phabricator.wikimedia.org/T147199) (owner: 10Ema) [12:44:51] (03CR) 10Gehel: [WIP] Allow multiple elasticsearch instances per host (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [12:45:14] jouncebot, next [12:45:14] In 0 hour(s) and 14 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1300) [12:46:07] !log restart mirror maker on kafka1012->1014 to pick up new openjdk-7 upgrades [12:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:28] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "Hiera keys required, admin_token:" [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196663) (owner: 10Arturo Borrero Gonzalez) [12:47:42] (03CR) 10Awight: "Looks like a typo in the link to Phabricator task?" [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196663) (owner: 10Arturo Borrero Gonzalez) [12:53:31] 10Operations: Upgrading python-requests on trusty - https://phabricator.wikimedia.org/T197088#4278969 (10MoritzMuehlenhoff) 05Open>03declined This turned out to be a rabbit hole which would also require upgrading urllib3, so in the end it was worked around in the debmonitor client. [12:53:45] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4278971 (10fgiunchedi) 05Open>03Resolved >>! In T183177#4278534, @akosiaris wrote: >>>! In T183177#4278435, @fgiunchedi wrote: >> I researched the "panic on unc... [12:57:28] !log elukey@deploy1001 Started deploy [analytics/aqs/deploy@84fab89]: Update AQS for T190213 [12:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:32] T190213: Analyze surge of traffic in AQS that lead to 504s - https://phabricator.wikimedia.org/T190213 [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1300). [13:00:04] CFisch_WMDE and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] !log elukey@deploy1001 Finished deploy [analytics/aqs/deploy@84fab89]: Update AQS for T190213 (duration: 02m 38s) [13:00:10] Present [13:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:18] here [13:00:26] (03PS1) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [13:01:45] I can SWAT today [13:02:17] !log rolling restart of cassandra in codfw to pick up OpenJDK security update [13:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:29] Hello zeljkof! [13:02:54] Hey zeljkof \o/ [13:03:56] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [13:05:48] CFisch_WMDE: you are first, I'll ping you when the patch is at mwdebug [13:06:01] Urbanecm: please stand by, you are second [13:06:08] yeah, I think there's no need for that zeljkof [13:06:16] nothing to see there :-) [13:06:56] CFisch_WMDE: I should deploy it without mwdebug? [13:07:05] yes you can do so [13:07:10] ack [13:07:17] CFisch_WMDE: ok, will do [13:08:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370) (owner: 10Addshore) [13:09:43] (03PS1) 10Gehel: Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 [13:10:13] (03Merged) 10jenkins-bot: Enable FileImporter monolog channel in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370) (owner: 10Addshore) [13:10:26] (03CR) 10jenkins-bot: Enable FileImporter monolog channel in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370) (owner: 10Addshore) [13:11:12] (03CR) 10jerkins-bot: [V: 04-1] Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 (owner: 10Gehel) [13:12:41] (03PS2) 10Gehel: Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 [13:12:55] !log zfilipin@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:439941| Enable FileImporter monolog channel in production (T195370)]] (duration: 01m 00s) [13:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:00] T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 [13:13:31] CFisch_WMDE: the patch is deployed, please check if there is anything to check :) and thanks for deploying with #releng ;) [13:14:10] zeljkof: thanks! [13:14:12] Urbanecm: please stand by, you are next; anything special about any patch? can not be tested, needs a long time to test, needs a script to run...? [13:14:30] zeljkof, you can deploy normally :) [13:15:12] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) (owner: 10Urbanecm) [13:15:58] (03CR) 10jerkins-bot: [V: 04-1] Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 (owner: 10Gehel) [13:16:01] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279030 (10Papaul) a:05Papaul>03ayounsi @ayounsi all fibers for lvs2010 and lvs2009 are already pulled according to the the first plan |LVS2009|C2|asw-c2|asw-a2|asw-b2... [13:18:09] (03PS12) 10Gehel: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [13:19:03] Urbanecm: 440009 has merge conflict :/ [13:19:14] zeljkof, I will fix it. Can you continue with others if possible please? [13:19:20] just noticed, I've switched to new gerrit UI, still confused [13:19:28] Urbanecm: ok, will go to the next one [13:19:54] (03PS2) 10Urbanecm: Make ProofreadPage operate on correct namespaces in pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) [13:19:54] ack [13:20:26] zeljkof, conflict fixed [13:21:51] zeljkof, I've removed your CR+2 on 440009, please readd it to restart tests [13:22:06] Urbanecm: ok [13:22:16] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4279056 (10mepps) Thanks @Dzahn! [13:22:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) (owner: 10Urbanecm) [13:23:25] (03PS3) 10Dzahn: mw-maintenance: rsync home dirs from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440099 (https://phabricator.wikimedia.org/T192092) [13:24:06] (03Merged) 10jenkins-bot: Make ProofreadPage operate on correct namespaces in pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) (owner: 10Urbanecm) [13:24:19] !log elukey@deploy1001 Started deploy [analytics/aqs/deploy@160206f]: (no justification provided) [13:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:47] Urbanecm: 440009 is at mwdebug [13:25:19] zeljkof, working, please deploy [13:25:29] Urbanecm: deploying [13:25:31] ack [13:26:32] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440009|Make ProofreadPage operate on correct namespaces in pmswikisource (T197033)]] (duration: 00m 57s) [13:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:37] T197033: ProofreadPage namespaces are wrong on pms.source - https://phabricator.wikimedia.org/T197033 [13:26:43] (03CR) 10Dzahn: [C: 032] mw-maintenance: rsync home dirs from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440099 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [13:27:18] Urbanecm: 440009 deployed [13:27:26] ack [13:27:37] (03PS2) 10Zfilipin: Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438271 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [13:28:30] !log elukey@deploy1001 Finished deploy [analytics/aqs/deploy@160206f]: (no justification provided) (duration: 04m 11s) [13:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] (03CR) 10jenkins-bot: Make ProofreadPage operate on correct namespaces in pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) (owner: 10Urbanecm) [13:29:16] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:45] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438271 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [13:31:15] (03Merged) 10jenkins-bot: Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438271 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [13:31:57] (03CR) 10Jcrespo: [C: 032] mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961 (owner: 10Jcrespo) [13:32:03] (03PS4) 10Jcrespo: mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961 [13:32:36] Urbanecm: 438271 at mwdebug [13:32:40] (03PS3) 10Dzahn: mw-deployment: remove rsync for tin home dirs [puppet] - 10https://gerrit.wikimedia.org/r/440100 (https://phabricator.wikimedia.org/T175288) [13:32:48] ack [13:32:50] (03PS2) 10Zfilipin: Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [13:33:34] (03CR) 10jenkins-bot: Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438271 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [13:33:47] Ehh [13:33:54] I must be asleep when writing that patch [13:34:08] Revert it please, I'll try it for another time and better... [13:34:16] zeljkof, ^ [13:34:27] (03PS4) 10Dzahn: mw-deployment: remove rsync for tin home dirs [puppet] - 10https://gerrit.wikimedia.org/r/440100 (https://phabricator.wikimedia.org/T175288) [13:34:33] Urbanecm: ok, reverting 438271 [13:34:35] ack [13:35:09] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4279103 (10WMDE-Fisch) [13:35:25] (03PS1) 10Zfilipin: Revert "Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440121 [13:35:44] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440121 (owner: 10Zfilipin) [13:37:16] (03Merged) 10jenkins-bot: Revert "Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440121 (owner: 10Zfilipin) [13:37:38] Urbanecm: 440121 reverted [13:37:49] ack [13:37:57] thank you [13:38:50] (03PS3) 10Zfilipin: Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [13:39:03] (03CR) 10jenkins-bot: Revert "Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440121 (owner: 10Zfilipin) [13:39:44] (03PS1) 10Urbanecm: Change meta namespace to Wikimedia_Indonesia on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440122 (https://phabricator.wikimedia.org/T196744) [13:40:06] mutante: Will it screw things up if I copy my home directory from terbium to mwmaint1001 and restart my maintenance script before you do your rsync? [13:41:10] (03PS1) 10Ema: cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 [13:41:42] (03CR) 10jerkins-bot: [V: 04-1] cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 (owner: 10Ema) [13:42:06] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [13:42:48] (03PS2) 10Ema: cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 (https://phabricator.wikimedia.org/T164609) [13:43:39] (03Merged) 10jenkins-bot: Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [13:43:52] (03CR) 10jenkins-bot: Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [13:44:42] Urbanecm: 437974 at mwdebug [13:44:44] ack [13:44:59] (03PS2) 10Zfilipin: English aliases for extra namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437976 (https://phabricator.wikimedia.org/T196614) (owner: 10Urbanecm) [13:46:19] zeljkof, working, please deploy [13:47:05] Urbanecm: deploying [13:47:10] ack [13:48:08] (03PS1) 10Awight: [DNM] Enable Extension:JADE in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440124 (https://phabricator.wikimedia.org/T183381) [13:48:16] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437974|Fix wrong language in ur.wiktionary namespace (T196614)]] (duration: 00m 58s) [13:48:16] hi there bawolff [13:48:19] Urbanecm: 437974 deployed [13:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:21] T196614: Wrong "index" namespace on ur.wiktionary - https://phabricator.wikimedia.org/T196614 [13:48:24] ack [13:48:58] Hi Hauskatze [13:49:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437976 (https://phabricator.wikimedia.org/T196614) (owner: 10Urbanecm) [13:49:35] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4279219 (10awight) [13:49:53] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#3851603 (10awight) [13:50:35] (03Merged) 10jenkins-bot: English aliases for extra namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437976 (https://phabricator.wikimedia.org/T196614) (owner: 10Urbanecm) [13:50:48] (03CR) 10jenkins-bot: English aliases for extra namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437976 (https://phabricator.wikimedia.org/T196614) (owner: 10Urbanecm) [13:51:09] (03PS6) 10Zfilipin: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [13:51:48] Urbanecm: 437976 at mwdebug [13:51:53] ack [13:52:31] working, please deploy [13:53:09] Urbanecm: deploying [13:53:12] ack [13:54:06] (03CR) 10Filippo Giunchedi: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [13:54:15] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437976|English aliases for extra namespaces on urwiktionary (T196614)]] (duration: 00m 58s) [13:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:20] T196614: Wrong "index" namespace on ur.wiktionary - https://phabricator.wikimedia.org/T196614 [13:54:33] Urbanecm: 437976 deployed [13:54:36] ack, thank you [13:56:00] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [13:56:34] (03PS2) 10Cmjohnson: Adding mgmt dns for labstore1008/9 [dns] - 10https://gerrit.wikimedia.org/r/439287 (https://phabricator.wikimedia.org/T193655) [13:56:53] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279256 (10ayounsi) a:05ayounsi>03Papaul I was not aware of T196560. Changes rolled back for all interfaces other than NIC1. [13:56:59] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for labstore1008/9 [dns] - 10https://gerrit.wikimedia.org/r/439287 (https://phabricator.wikimedia.org/T193655) (owner: 10Cmjohnson) [13:57:18] (03PS1) 10Volans: Hosts update: fix use of already queried objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) [13:57:24] (03Merged) 10jenkins-bot: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [13:57:35] (03PS1) 10Addshore: wgMultiContentRevisionSchemaMigrationStage MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) [13:57:37] (03PS2) 10Volans: Hosts update: fix use of already queried objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) [13:58:27] Urbanecm: 437777 at mwdebug [13:58:31] ack [13:58:35] (03CR) 10jerkins-bot: [V: 04-1] Hosts update: fix use of already queried objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:59:23] zeljkof, working, please deploy [13:59:31] Urbanecm: deploying [14:00:38] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437777|Implementing Patroller User Rights for azwiki (T196488)]] (duration: 00m 57s) [14:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:43] T196488: Create user rights on azwiki - https://phabricator.wikimedia.org/T196488 [14:00:53] Urbanecm: 437777 [14:00:58] Urbanecm: 437777 deployed [14:01:08] please check and thanks for deploying with #releng ;) [14:01:10] Thank you. Finished just in time [14:01:16] Thank you for your cooperation! [14:01:33] punctual as death ;) [14:01:37] (03CR) 10jenkins-bot: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [14:01:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 (owner: 10Marostegui) [14:01:41] (always comes on time) [14:01:50] !log EU SWAT finished [14:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:22] (03PS1) 10Elukey: profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) [14:02:33] (03CR) 10Daniel Kinzler: [C: 031] "This should be done before I6d6c642a0e349646726 lands in master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [14:03:20] (03PS3) 10Volans: Hosts update: fix use of already queried objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) [14:03:22] (03PS1) 10Volans: Client CLI: fix text for versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440131 (https://phabricator.wikimedia.org/T191300) [14:03:24] Now, let's find time for other ~12 patches waiting for deploying :D [14:03:58] (03CR) 10Filippo Giunchedi: [C: 031] "FWIW we're trying to migrate away from dashboards with "prometheus" in the name, so consider renaming the dashboard first and then add the" [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [14:04:31] (03CR) 10jerkins-bot: [V: 04-1] Hosts update: fix use of already queried objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:04:33] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: fix text for versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440131 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:04:45] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4279304 (10Cmjohnson) @bstorm the dns patch was not merged. FIxed now...feel free to take over [14:05:06] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 (owner: 10Marostegui) [14:05:48] (03PS3) 10Marostegui: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 [14:07:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440071 (owner: 10Marostegui) [14:08:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 with low weight (duration: 00m 58s) [14:08:53] (03CR) 10Ottomata: [C: 031] profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) (owner: 10Elukey) [14:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:04] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440132 [14:13:31] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4279334 (10Papaul) @ayounsi thanks [14:15:56] !log anomie@deploy1001 Synchronized php-1.32.0-wmf.8/includes/Category.php: Backporting fix for T195397 ([[gerrit:440053]]) (duration: 01m 00s) [14:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:00] T195397: {{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion - https://phabricator.wikimedia.org/T195397 [14:17:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440132 (owner: 10Marostegui) [14:18:33] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440132 (owner: 10Marostegui) [14:18:45] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:18:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440132 (owner: 10Marostegui) [14:19:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1076 (duration: 00m 57s) [14:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:05] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:22:45] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4279369 (10Gehel) 05Open>03Resolved a:03Gehel 2.2.6-wmf5 uploaded to reprepro, we can close this task. [14:24:43] anomie: i was about to copy all the home dirs for the users [14:24:58] anomie: it won't mess up your files [14:25:14] if your script can run i am not that sure [14:26:20] (03PS1) 10Marostegui: db-eqiad.php: Restore db1076 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440134 [14:27:03] mutante: mwscript eval.php seems to work on mwmaint1001, so I think the maintenance script should work too. [14:27:52] anomie: ok, cool, go ahead then [14:28:36] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [14:31:56] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [14:50:15] RECOVERY - IPMI Sensor Status on maps1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:00:53] mutante: I see the 'time' package isn't installed on mwmaint1001. I can work around that for now. Should I file a task? [15:02:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1076 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440134 (owner: 10Marostegui) [15:04:11] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1076 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440134 (owner: 10Marostegui) [15:04:32] 10Operations, 10ops-codfw: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664#4279518 (10Papaul) [15:05:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore original weight for db1076 (duration: 00m 58s) [15:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] anomie: but you still have the time command, just not the time package. do you need the GNU version? [15:07:38] mutante: Yeah, the GNU version gives extra info like peak resident memory. For now I downloaded the package and extracted the binary into my home directory. [15:08:15] anomie: gotcha, i will make puppet install the package , no ticket needed in this case [15:08:18] doing it right away [15:08:22] Ok, thanks [15:10:11] (03CR) 10Rush: "Neat idea, I have seen this done but with a "watcher" per broadcast domain that builds a table of mac:ip by listening for broadcasts and l" [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [15:11:31] (03CR) 10Vgutierrez: "> Neat idea, I have seen this done but with a "watcher" per broadcast" [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [15:12:25] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 352.16 seconds [15:13:16] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.10 seconds [15:16:28] (03CR) 10Andrew Bogott: [C: 04-1] "> I tried using os.path.expanduser('~/.webservicerc'), in that case, it wasn't reading arguments from the file." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:17:56] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 8.07 seconds [15:18:15] (03PS1) 10Dzahn: mw-maintenance: require GNU time from time package [puppet] - 10https://gerrit.wikimedia.org/r/440139 (https://phabricator.wikimedia.org/T192092) [15:18:55] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:21:54] (03PS1) 10Jcrespo: mariadb: Remove refereces to db1053 and db1059 and set them as spare [puppet] - 10https://gerrit.wikimedia.org/r/440140 (https://phabricator.wikimedia.org/T194634) [15:23:43] (03PS2) 10Dzahn: mw-maintenance: require GNU time from time package [puppet] - 10https://gerrit.wikimedia.org/r/440139 (https://phabricator.wikimedia.org/T192092) [15:24:30] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1076 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440134 (owner: 10Marostegui) [15:25:23] (03PS3) 10Dzahn: mw-maintenance: require GNU time from time package [puppet] - 10https://gerrit.wikimedia.org/r/440139 (https://phabricator.wikimedia.org/T192092) [15:25:27] !log stopping db1053 and db1059 in preparation for decomm T194634 T196606 [15:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:33] T196606: Decommission db1059 - https://phabricator.wikimedia.org/T196606 [15:25:33] T194634: Decommission db1053 - https://phabricator.wikimedia.org/T194634 [15:25:38] (03CR) 10Dzahn: [C: 032] mw-maintenance: require GNU time from time package [puppet] - 10https://gerrit.wikimedia.org/r/440139 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:26:31] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4279638 (10Joe) [15:27:34] (03PS2) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [15:28:37] Notice: /Stage[main]/Packages::Time/Package[time]/ensure: created [15:28:46] anomie: GNU time is now installed [15:29:04] Thanks! [15:30:56] yw [15:34:09] (03PS1) 10Ladsgroup: mediawiki: Stop Wikidata dispatching [puppet] - 10https://gerrit.wikimedia.org/r/440142 (https://phabricator.wikimedia.org/T192092) [15:36:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: disable keystone hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/440144 (https://phabricator.wikimedia.org/T196633) [15:36:57] (03CR) 10Muehlenhoff: [C: 031] Client CLI: fix text for versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440131 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [15:37:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: disable keystone hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/440144 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:39:06] (03CR) 10Ladsgroup: "Hey, Please merge and deploy this before switching to mwmaint1001 (it will probably cause some alarms to scream) and then revert it once t" [puppet] - 10https://gerrit.wikimedia.org/r/440142 (https://phabricator.wikimedia.org/T192092) (owner: 10Ladsgroup) [15:39:18] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4279724 (10herron) @fgiunchedi and I looked into this some more this morning. Requests have have slowed (for now anyway) so let's revert the... [15:44:11] (03PS1) 10Herron: Revert "mailman: add per IP rate limit of 50 requests per 5 min" [puppet] - 10https://gerrit.wikimedia.org/r/440146 [15:44:35] (03PS1) 10Andrew Bogott: nova: move glance_host into hiera so it can be configured per-deploy [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) [15:45:32] (03PS2) 10Herron: Revert "mailman: add per IP rate limit of 50 requests per 5 min" [puppet] - 10https://gerrit.wikimedia.org/r/440146 (https://phabricator.wikimedia.org/T196989) [15:46:49] (03PS1) 10Elukey: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/440148 [15:47:03] (03PS3) 10Herron: Revert "mailman: add per IP rate limit of 50 requests per 5 min" [puppet] - 10https://gerrit.wikimedia.org/r/440146 (https://phabricator.wikimedia.org/T196989) [15:47:37] (03CR) 10Herron: [C: 032] Revert "mailman: add per IP rate limit of 50 requests per 5 min" [puppet] - 10https://gerrit.wikimedia.org/r/440146 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [15:47:45] (03CR) 10Joal: [C: 031] "Thanks a lot Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/440148 (owner: 10Elukey) [15:48:30] (03CR) 10Elukey: [C: 032] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/440148 (owner: 10Elukey) [15:48:39] (03PS2) 10Elukey: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/440148 [15:48:51] (03CR) 10Elukey: [V: 032 C: 032] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/440148 (owner: 10Elukey) [15:49:19] * elukey forces himself to retry polygerrit [15:55:13] !log rolling restart of aqs on aqs100[4-9] to pick up the new config changes [15:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:48] (03PS5) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [16:00:20] (03PS1) 10Volans: Fix /client endpoint access [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440155 (https://phabricator.wikimedia.org/T191299) [16:00:24] (03CR) 10Gehel: [C: 032] logstash: typo gelf long_message -> full_message [puppet] - 10https://gerrit.wikimedia.org/r/437864 (owner: 10EBernhardson) [16:00:31] (03PS3) 10Gehel: logstash: typo gelf long_message -> full_message [puppet] - 10https://gerrit.wikimedia.org/r/437864 (owner: 10EBernhardson) [16:01:19] (03CR) 10jerkins-bot: [V: 04-1] Fix /client endpoint access [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440155 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [16:01:22] (03PS3) 10Ema: cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 (https://phabricator.wikimedia.org/T164609) [16:01:24] (03PS1) 10Ema: cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) [16:02:01] (03PS2) 10Jcrespo: mariadb: Remove refereces to db1053 and db1059 and set them as spare [puppet] - 10https://gerrit.wikimedia.org/r/440140 (https://phabricator.wikimedia.org/T194634) [16:02:45] (03PS3) 10Jcrespo: mariadb: Remove references to db1053 and db1059 and set them as spare [puppet] - 10https://gerrit.wikimedia.org/r/440140 (https://phabricator.wikimedia.org/T194634) [16:03:11] (03CR) 10Volans: [V: 032 C: 032] Client CLI: fix text for versioning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440131 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [16:03:18] (03PS4) 10Jcrespo: mariadb: Remove references to db1053 and db1059 and set them as spare [puppet] - 10https://gerrit.wikimedia.org/r/440140 (https://phabricator.wikimedia.org/T194634) [16:03:29] (03CR) 10Rush: nova: move glance_host into hiera so it can be configured per-deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [16:04:01] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1053 and db1059 and set them as spare [puppet] - 10https://gerrit.wikimedia.org/r/440140 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo) [16:04:07] (03Abandoned) 10Ema: vcl: remove 3DES deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/440087 (https://phabricator.wikimedia.org/T147199) (owner: 10Ema) [16:05:30] (03CR) 10Andrew Bogott: nova: move glance_host into hiera so it can be configured per-deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [16:07:08] (03Abandoned) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440074 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:07:48] !log installing imagemagick security updates on trusty (Debian already fixed) [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:55] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.02 seconds [16:10:50] did anyone disable puppet on einstenium ? [16:11:34] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.27 seconds [16:11:58] !log installing plexus-archiver security updates [16:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:30] jynus: don’t think so, there should be "Disabling Puppet” logged in /var/log/puppet.log when it happens [16:13:01] I think someone or something created /var/lib/puppet/state/agent_catalog_run.lock by mistake [16:13:04] !log ALTERing Cassandra schema - T197082 [16:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:09] T197082: Cassandra schema migrations to add page_language - https://phabricator.wikimedia.org/T197082 [16:14:05] logs say Applied catalog in 38.98 seconds [16:14:11] but lock is still on [16:14:21] !log rsyncing /home dirs from terbium to mwmaint1001, they will appear later in a subdir "home-terbium" like it was done for tin->deploy1001 (T192092) [16:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:26] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [16:14:39] it finnaly worked [16:15:00] hmm yeah I see that as well. maybe someone trying a manual run while the scheduled run is in progress? [16:15:06] a few times? [16:15:14] well, that trials was me [16:16:12] sudo apt-get --hold ? [16:16:39] maybe it looked locked but it was running, sometimes icinga config is quite slow [16:17:23] mutante: so no terbium anymore? Shall we update the on-wiki docs? [16:19:05] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [16:19:40] Hauskatze: as of today it still exists, actual switch is tomorrow though. updating docs now would not hurt. one user has already migrated his manual maint job [16:19:47] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946#4280024 (10Papaul) [16:20:41] i want to add a patch to get a big warning motd that this isnt the active server, like on non-active deployment server [16:23:47] testwiki seems to be throwing wikidata errors [16:23:55] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 59.53 seconds [16:23:59] [WyFE9ApAIEIAAELTvtYAAACE] /wiki/Main_Page Wikibase\DataModel\Services\Lookup\EntityLookupException from line 44 of /srv/mediawiki/php-1.32.0-wmf.8/extensions/Wikibase/lib/includes/Store/RevisionBasedEntityLookup.php: The serialization "L3" is not recognized by the configured id builders [16:24:08] on view of testwiki main page [16:29:28] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3420747 (10Tgr) Any thoughts on whether this might make something like {T189531} easier or harder in the future? [16:30:58] bawolff: from wmf.8, huh that's from last week [16:31:31] I think wmf.8 is this week (?) [16:31:39] !log moving mr1-eqiad interfaces to new router [16:31:40] In any case, it was on testwiki right now [16:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:00] bah, I misread :) yes, wmf.8 is this week [16:33:12] FYI, we're going to move the links from the old to the new mr1-eqiad, as far as I know we can't downtime all the mgmt interfaces, so it might be noisy, but should be transparent if no issues [16:34:37] (03PS2) 10Volans: Fix /client endpoint access [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440155 (https://phabricator.wikimedia.org/T191299) [16:34:47] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4280077 (10jcrespo) [16:35:43] (03CR) 10jerkins-bot: [V: 04-1] Fix /client endpoint access [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440155 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [16:36:27] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4280093 (10Papaul) p:05Triage>03Normal [16:36:41] bawolff: looks like https://phabricator.wikimedia.org/T195615 [16:38:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@f521e7e]: Add page_language to title_revision table T197082 [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:08] T197082: Cassandra schema migrations to add page_language - https://phabricator.wikimedia.org/T197082 [16:43:09] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#4280140 (10akosiaris) >>! In T170150#4280054, @Tgr wrote: > Any thoughts on whether this might make something like {T189531} easier or hard... [16:44:50] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4280160 (10Papaul) @jcrespo disk replacement complete [16:46:15] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4280167 (10Marostegui) ``` physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding) ``` Thanks [16:48:11] (03PS1) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [16:48:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [16:50:02] (03PS1) 10Urbanecm: Allow sysops to grant autopatrolled and patroller on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440163 (https://phabricator.wikimedia.org/T196488) [16:52:04] (03PS1) 10Gehel: cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 [16:53:21] PROBLEM - Host labvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:47] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@f521e7e]: Add page_language to title_revision table T197082 (duration: 15m 44s) [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:54] T197082: Cassandra schema migrations to add page_language - https://phabricator.wikimedia.org/T197082 [16:54:58] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4280199 (10Cmjohnson) a:05Cmjohnson>03faidon @faidon I disconnected the 1G cables from the switch and plugged the cables into the 10G... [16:56:17] jouncebot, next [16:56:17] In 0 hour(s) and 3 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1700) [16:57:39] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280214 (10awight) [16:57:45] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4280217 (10Papaul) @ayounsi the name proposal is just temporally so i can add the switches in racktables and do the setup in the scs-a1/c1. After you are done with the configuration and we remove the ol... [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:24] There are my patches [17:00:25] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280229 (10awight) [17:00:27] (03CR) 10Elukey: "I like it! Left one nit :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [17:00:33] jouncebot, refresh [17:00:34] I refreshed my knowledge about deployments. [17:01:56] Reedy, can you SWAT? :) [17:02:14] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280234 (10Halfak) [17:02:17] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280237 (10awight) [17:03:21] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [17:05:31] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [17:06:18] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280245 (10Halfak) [17:06:31] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4260809 (10Halfak) [17:07:02] RECOVERY - Device not healthy -SMART- on db2052 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw%2520prometheus%252Fops [17:07:06] (03PS1) 10Volans: Tests: remove spurious lines [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440167 [17:07:08] (03PS1) 10Volans: Force text responses on API errors [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440168 (https://phabricator.wikimedia.org/T191299) [17:08:15] (03CR) 10jerkins-bot: [V: 04-1] Force text responses on API errors [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440168 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [17:08:15] !log terbium - closing unusued screen sessions for all Amir users (2) [17:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:30] (03CR) 10jerkins-bot: [V: 04-1] Tests: remove spurious lines [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440167 (owner: 10Volans) [17:08:41] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.94 seconds [17:12:32] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 355.88 seconds [17:14:45] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4280307 (10awight) [17:16:59] (03CR) 10Muehlenhoff: [C: 031] Tests: remove spurious lines [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440167 (owner: 10Volans) [17:19:02] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:19:04] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4280315 (10Papaul) [17:21:19] 10Operations, 10Cassandra, 10Maps-Sprint, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4280346 (10Gehel) [17:26:46] (03PS2) 10Ladsgroup: Enable wp10 data storage in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432093 (https://phabricator.wikimedia.org/T192268) [17:29:09] Anybody here to SWAT? Window is running... [17:32:07] this window has been a bad time for most people, I'm considering moving it to improve the situation [17:33:48] greg-g, I usually use EU SWAT, but in my experience, usually, less number of patches can be deployed during EU one and higher in Morning one. If somebody comes, ofc [17:33:58] can you share what timeframes you are considering? [17:36:24] 1 hour earlier [17:36:56] right now is Scrum of Scrums (a cross team 30 minute meeting which pulls in various people who are usually here for SWATs, either as deployers or patch submitters) [17:37:12] so people don't want to start something that might take too long before having to go to that meeting [17:37:43] but if this SWAT were at 16:00 UTC instead, that doesn't overlap with too many big meetings like that [17:38:13] Heh, good meeting [17:40:59] Urbanecm: I can SWAT if you'd like? [17:41:59] Is there time to add a patch? [17:42:56] jouncebot, help [17:42:57] **** JounceBot Help **** [17:42:57] JounceBot is a deployment helper bot for the Wikimedia Foundation. [17:42:57] You can find my source at https://github.com/mattofak/jouncebot [17:42:57] Available commands: [17:42:57] HELP Prints the list of all commands known to the server [17:42:57] NEXT Get the next deployment event(s if they happen at the same time) [17:42:57] NOW Get the current deployment event(s) or the time until the next [17:42:58] REFRESH Refresh my knowledge about deployments [17:43:00] jouncebot, refresh [17:43:01] I refreshed my knowledge about deployments. [17:43:03] jouncebot, now [17:43:04] For the next 0 hour(s) and 16 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1700) [17:43:11] oh well it doesn't print the new patches [17:43:11] ok [17:44:16] Krenair: I can do yours [17:44:25] thanks :) [17:44:43] (03CR) 1020after4: [C: 032] deployment-prep: Update BounceHandlerInternalIPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436430 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [17:45:03] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 2 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4280469 (10pmiazga) I'll check service logs [17:46:33] (03Merged) 10jenkins-bot: deployment-prep: Update BounceHandlerInternalIPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436430 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [17:46:47] twentyafterfour, if you want [17:47:16] 10Operations, 10Cassandra, 10Maps-Sprint, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4280477 (10Gehel) Editing `/srv/deployment/cassandra/metrics-collector-cache/.config` to replace the reference to `tin` with a ref to `deploy1... [17:47:38] twentyafterfour, added my patches [17:47:40] Thank you [17:47:52] thanks twentyafterfour [17:49:01] !log twentyafterfour@deploy1001 Synchronized wmf-config: Sync https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/436430/ for SWAT refs T184244 (duration: 01m 00s) [17:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:05] T184244: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244 [17:49:46] (03CR) 1020after4: [C: 032] Regenerate logo for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439394 (https://phabricator.wikimedia.org/T196803) (owner: 10Urbanecm) [17:50:49] (03CR) 1020after4: [C: 032] Allow sysops to grant autopatrolled and patroller on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440163 (https://phabricator.wikimedia.org/T196488) (owner: 10Urbanecm) [17:51:16] (03Merged) 10jenkins-bot: Regenerate logo for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439394 (https://phabricator.wikimedia.org/T196803) (owner: 10Urbanecm) [17:52:06] (03PS3) 10Rush: openstack: allow designate in labtest to contact labtestn keystone [puppet] - 10https://gerrit.wikimedia.org/r/437812 (https://phabricator.wikimedia.org/T167559) [17:52:10] (03Merged) 10jenkins-bot: Allow sysops to grant autopatrolled and patroller on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440163 (https://phabricator.wikimedia.org/T196488) (owner: 10Urbanecm) [17:53:15] !log twentyafterfour@deploy1001 Synchronized static/images/project-logos/bnwikivoyage-1.5x.png: static/images/project-logos/bnwikivoyage-2x.png static/images/project-logos/bnwikivoyage.png sync bnwikivoyage logos refs T196803 (duration: 00m 58s) [17:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:20] T196803: Regenerate bnwikivoyage logo - https://phabricator.wikimedia.org/T196803 [17:54:00] (03CR) 10Rush: [C: 032] openstack: allow designate in labtest to contact labtestn keystone [puppet] - 10https://gerrit.wikimedia.org/r/437812 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [17:54:32] sync-file wildcard fail [17:55:14] !log twentyafterfour@deploy1001 Synchronized static/images/project-logos/: sync bnwikivoyage logos refs T196803 (duration: 00m 58s) [17:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] (03CR) 1020after4: [C: 032] Change meta namespace to Wikimedia_Indonesia on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440122 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [17:56:04] (03PS2) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [17:56:07] (03CR) 10jenkins-bot: deployment-prep: Update BounceHandlerInternalIPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436430 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [17:56:09] (03CR) 10jenkins-bot: Regenerate logo for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439394 (https://phabricator.wikimedia.org/T196803) (owner: 10Urbanecm) [17:56:10] oh, I was awarded a badge by zeljkof [17:56:12] (03CR) 10jenkins-bot: Allow sysops to grant autopatrolled and patroller on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440163 (https://phabricator.wikimedia.org/T196488) (owner: 10Urbanecm) [17:56:18] (03PS2) 1020after4: Whitelist *.jpl.nasa.gov [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [17:56:42] Hauskatze, same badge I have :) [17:56:51] * Hauskatze wants 'The Janitor' one. [17:56:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [17:56:57] (03PS3) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [17:57:19] (03Abandoned) 10Rush: openstack: backports setup initial run [puppet] - 10https://gerrit.wikimedia.org/r/420060 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:57:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [17:57:48] (03PS4) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [17:57:51] Hauskatze should ask zeljkof then :D [17:58:01] heh [17:58:08] (03CR) 1020after4: [C: 04-1] "I don't think this will work. *.nasa.gov is already whitelisted, I think you want to whitelist *.jpl.nasa.gov but this patch doesn't do t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [17:58:16] (03PS5) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [17:58:37] Hauskatze: I like awarding badges ;) [17:58:52] (03PS3) 10Urbanecm: Whitelist *.jpl.nasa.gov [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) [17:58:57] (03CR) 10Urbanecm: "Thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [17:59:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [17:59:22] (03CR) 1020after4: [C: 032] "better!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [17:59:40] (03PS2) 1020after4: Change meta namespace to Wikimedia_Indonesia on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440122 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [17:59:51] Hauskatze, BTW, Jantior is disabled, isn't it [18:00:00] (03PS6) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1800) [18:00:34] oh, I didn't noticed that [18:01:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:01:17] (03Merged) 10jenkins-bot: Whitelist *.jpl.nasa.gov [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [18:01:27] btw it was just a joke [18:01:47] (03PS3) 1020after4: Change meta namespace to Wikimedia_Indonesia on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440122 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [18:02:43] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4280521 (10Volans) Quick first feedback/questions on the proposal: > dbconfig get NAME gets you all the current configuration of a mysql... [18:02:45] (03PS7) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:03:14] (03CR) 10jenkins-bot: Whitelist *.jpl.nasa.gov [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) (owner: 10Urbanecm) [18:03:20] (03CR) 1020after4: [C: 032] Set wgLocaltimezone to Europe/Rome for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438282 (https://phabricator.wikimedia.org/T196763) (owner: 10Urbanecm) [18:03:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:06:05] (03PS2) 1020after4: Set wgLocaltimezone to Europe/Rome for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438282 (https://phabricator.wikimedia.org/T196763) (owner: 10Urbanecm) [18:08:19] (03PS8) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:09:01] (03CR) 10jenkins-bot: Change meta namespace to Wikimedia_Indonesia on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440122 (https://phabricator.wikimedia.org/T196744) (owner: 10Urbanecm) [18:09:38] twentyafterfour, what's the status? I don't see any related items in SAL [18:09:51] Urbanecm: I'm fighting zuul to get them merged [18:10:07] since several of the changes only touch initializesettings, I'm going to sync them all at once [18:10:40] Only https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/438282/ isn't merged... [18:10:43] Urbanecm: can you test on mwdebug once I sync initializesettings to the canary hosts? [18:10:43] twentyafterfour: is there any scap command to sync the interwiki cache and generate the patch-set to upload? [18:10:57] twentyafterfour, sure [18:11:00] Hauskatze: I'm not sure [18:11:18] I don't think we have a utility for that but it sounds like a useful thing to write [18:12:16] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.57 seconds [18:12:17] scap update-interwiki-cache [18:12:21] twentyafterfour, Hauskatze: ^^ [18:12:39] Per https://wikitech.wikimedia.org/wiki/Add_a_wiki#Database_creation, step 13 [18:13:31] (03CR) 10jenkins-bot: Set wgLocaltimezone to Europe/Rome for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438282 (https://phabricator.wikimedia.org/T196763) (owner: 10Urbanecm) [18:13:37] Urbanecm: all four changes to initializesettings are live on mwdebug1001, can you verify that everything seems to be correct using mwdebug extension...then I will sync to all of production [18:13:42] Sure, will do [18:13:51] thanks Urbanecm [18:14:02] yw Hauskatze. Why do you need it? :) [18:14:05] If I may ask [18:14:15] there's another update request comming I think [18:14:26] we've not touched the code on meta yet though [18:14:38] I am having a little kit-kat-wiki-break [18:14:40] greg-g: Moving the SWAT window back to 11am PST would be good. [18:15:10] twentyafterfour, are they really live? [18:15:15] On mwdebug1001 [18:15:22] Urbanecm: should be [18:16:07] (03PS9) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:16:09] twentyafterfour, all changes appears to not be live [18:16:13] Urbanecm: try now [18:16:17] sorry I missed a step [18:16:24] That's better, going to test [18:16:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:16:58] twentyafterfour, working, please deploy them [18:17:37] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 40.34 seconds [18:17:39] 👍 [18:18:16] (03PS10) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:18:27] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [18:18:39] twentyafterfour, btw, can you run namespaceDupes.php on urwiktionary please? [18:19:52] And also, please purge the logo URLs [18:20:11] Urbanecm: I'm not familiar with namespaceDupes... [18:20:19] Docs for purging are at https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [18:20:43] twentyafterfour, general syntax is mwscript namespaceDupes.php --wiki= --fix (without --fix it'll be a dry run). [18:20:48] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: sync 5b47244 14ca2ba 63bc100m and 2569a77 refs T196488, T196744, T196727, T196763 (duration: 00m 57s) [18:20:49] Can you please run dry run at least for me? [18:20:51] (03PS11) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:59] T196727: *.nasa.gov URL upload not available - https://phabricator.wikimedia.org/T196727 [18:20:59] T196744: Change wgMetanamespace on idwikimedia to "Wikimedia Indonesia" - https://phabricator.wikimedia.org/T196744 [18:20:59] T196763: Timezone for pmswikisource should be Europe/Rome - https://phabricator.wikimedia.org/T196763 [18:20:59] T196488: Create user rights on azwiki - https://phabricator.wikimedia.org/T196488 [18:21:06] twentyafterfour, docs are at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes [18:21:28] mwscript namespaceDupes.php --wiki=wikidbname (this is dry-run) [18:21:37] Exactly Hauskatze, thank you [18:22:01] !log ran namespaceDupes for urwiktionary [18:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:48] twentyafterfour, can you post result to T196614 please [18:22:49] T196614: Wrong "index" namespace on ur.wiktionary - https://phabricator.wikimedia.org/T196614 [18:22:49] ? [18:23:26] Does anyone know if there's an audit trail for apt.wikimedia.org uploads? [18:24:18] Urbanecm: done [18:24:21] Thx [18:24:31] Were the logo URLs purged? [18:24:33] Urbanecm: do you know the full urls of those logos? [18:24:36] Sure [18:24:40] https://en.wikipedia.org/static/images/project-logos/bnwikivoyage.png [18:24:41] I'm logged in to terbium [18:24:44] https://en.wikipedia.org/static/images/project-logos/bnwikivoyage-1.5x.png [18:24:45] https://en.wikipedia.org/static/images/project-logos/bnwikivoyage-2x.png [18:24:50] (03PS12) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:26:22] (03PS13) 10Ottomata: [WIP] SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [18:26:31] Urbanecm: purged [18:26:35] Thank you [18:26:39] you're welcome [18:28:20] !lot finished swat (28m overtime! :P) [18:28:25] grr [18:28:33] !log finished swat (28m overtime! :P) [18:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [18:31:44] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [18:32:33] (03PS1) 10MaxSem: Graduate CodeMirror out of beta on non-RTL wikis, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) [18:32:35] (03PS1) 10MaxSem: Graduate CodeMirror out of beta on non-RTL wikis, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440179 (https://phabricator.wikimedia.org/T185030) [18:33:10] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4280811 (10Krenair) Does someone already working on this want to replace the instance or shall I start a deployment-... [18:36:30] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4280883 (10chasemp) p:05Triage>03Normal [18:37:23] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4280883 (10chasemp) @cmjohnson could you describe a bit what you've tried to get the 10G ports to work? [18:47:37] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4280925 (10chasemp) ping @aborrero who indicated he had seem a similar issue in the past [18:49:14] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [18:50:54] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.068 second response time [18:52:25] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:52:51] 10Operations, 10Mail, 10monitoring, 10Wikimedia-Incident: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171#4280935 (10herron) p:05Triage>03Normal [18:53:00] 10Operations, 10Mail, 10monitoring, 10Wikimedia-Incident: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172#4280946 (10herron) [18:53:04] 10Operations, 10Mail, 10Wikimedia-Logstash: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173#4280956 (10herron) [18:53:27] 10Operations, 10Mail, 10monitoring, 10Wikimedia-Incident: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171#4280968 (10herron) Implementation-wise, parsing MX logs seems like a good bet. For starters we could approximate services through a combination... [19:00:04] marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T1900). [19:01:04] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.092 second response time [19:04:54] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.23 seconds [19:06:10] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4280883 (10Imarlier) Are these all the same type of machines? How spread at the MAC addresses? Saw something like this a number of years ago, and it turned out to be a bad run of Broadcom... [19:12:14] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 348.34 seconds [19:13:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4281030 (10Bstorm) For my reference, if nothing else: labstore1008 ``` Embedded NIC MAC Addresses: NIC.Integrated.1-1-1 Ethernet... [19:18:45] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:23:13] (03PS1) 10Dduvall: group1 wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440187 [19:23:15] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440187 (owner: 10Dduvall) [19:25:01] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440187 (owner: 10Dduvall) [19:25:44] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.8 [19:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.8 (duration: 00m 57s) [19:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:47] Would anyone with +2 on puppet be able to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439648/ for me? [19:40:56] marlier: sure, looking at it now [19:42:10] (03PS4) 10Herron: Need to install mongodb on xhgui machines [puppet] - 10https://gerrit.wikimedia.org/r/439648 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [19:44:31] (03CR) 10Herron: [C: 032] Need to install mongodb on xhgui machines [puppet] - 10https://gerrit.wikimedia.org/r/439648 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [19:45:13] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4281091 (10Krenair) So I think to replace it properly we need https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436431/ and https://gerrit.wiki... [19:45:27] (03CR) 10Eevans: [C: 031] "+1 to @elukey's nit, LGTM otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [19:46:51] marlier: merged, and puppet ran happily on webperf[12]002 afterwards [19:49:04] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.195 second response time [19:52:44] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440187 (owner: 10Dduvall) [19:53:50] herron: thank you! [19:54:34] np! [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T2000). [20:00:35] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.06 seconds [20:00:54] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.05 seconds [20:00:55] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.35 seconds [20:01:04] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.02 seconds [20:01:14] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.29 seconds [20:01:24] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.32 seconds [20:01:36] Does anyone know what mechanism is used to remove decommissioned hosts from puppet DB? [20:02:04] herron possibly? [20:02:39] Krenair: yes deactivating the node will effectively do it [20:03:01] the record isn’t removed from puppetdb, but exported resources, etc. will stop [20:04:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.080 second response time [20:04:25] will it still show up to cumin querying the puppet DB API herron? [20:04:38] i.e. /pdb/query/v4/ calls [20:05:00] so resources or nodes [20:05:04] jouncebot: I'll deploy a minor ORES update during the Services window. [20:07:13] it depends on the endpoint, but cumin should do the right thing once deactivated [20:07:21] !log awight@deploy1001 Started deploy [ores/deploy@36037b6]: New badwords for ORES in English: T196468 [20:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:27] T196468: Catch a specific new badword in English - https://phabricator.wikimedia.org/T196468 [20:09:08] Krenair: node clean and node deactivate revokes the cert and remove the node from puppetdb [20:09:36] cool [20:09:41] thanks herron & volans [20:09:56] in which context do you need it? [20:10:34] I ran a cumin query using the puppet DB backend and a bunch of deleted instances came up [20:10:52] beta cluster? [20:10:54] yeah [20:11:04] ok makes sense [20:11:12] (03CR) 10Mholloway: [C: 031] "I don't have +2 on this repo, but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/439543 (owner: 10Gehel) [20:11:17] cleaned and deactivated the nodes for those now [20:11:24] and cumin doesn't see them anymore [20:11:26] so, success \o/ [20:11:34] yep :) [20:11:49] volans: fwiw node deactivate marks as deactivated, but doesn’t remove from puppetdb entirely. you can find deactivated nodes, resources, etc. from the root endpoint [20:12:34] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 382.78 seconds [20:13:28] (03PS14) 10Ottomata: SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [20:19:15] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:20:10] (03PS1) 10Ottomata: Re-add dummy kafka_main-eqiad_broker dummy certificates [labs/private] - 10https://gerrit.wikimedia.org/r/440193 [20:20:44] (03CR) 10Ottomata: [V: 032 C: 032] Re-add dummy kafka_main-eqiad_broker dummy certificates [labs/private] - 10https://gerrit.wikimedia.org/r/440193 (owner: 10Ottomata) [20:21:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.071 second response time [20:21:56] ORES canary is healthy, continuing with deployment... [20:22:07] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4140052 (10Ottomata) @mep [20:42:14] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 0.066 second response time [20:47:09] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4281301 (10mepps) @Ottomata just the LDAP login isn't working. I should be in wmf... [20:49:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.077 second response time [20:50:53] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4281304 (10Ottomata) @mepps can you find me on IRC (ottomata) or google chat (aotto@wikimedia.org)? No... [20:54:26] awight, did the deploy finish? [20:54:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.106 second response time [20:54:39] I didn't see a log line [20:56:54] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 33.98 seconds [20:56:54] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 53.70 seconds [20:57:15] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:57:25] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [20:57:35] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [20:57:44] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.80 seconds [20:57:45] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [20:57:54] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [21:00:31] (03PS13) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [21:01:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [21:03:08] (03Abandoned) 10MaxSem: Graduate CodeMirror out of beta on non-RTL wikis, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440179 (https://phabricator.wikimedia.org/T185030) (owner: 10MaxSem) [21:07:06] (03PS2) 10MaxSem: Graduate CodeMirror out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) [21:10:59] (03CR) 10Gehel: [WIP] Allow multiple elasticsearch instances per host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [21:12:04] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1974 bytes in 0.186 second response time [21:12:25] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.12 seconds [21:15:58] halAFK: no, surprisingly! It's at 17/18 hosts, seems to still be running and not stalled. [21:16:04] New badness. [21:16:18] uhoh [21:16:32] actually going AFK now. WIll be on gchat/telegram. [21:16:44] halAFK: fyi it's finishing now. [21:16:51] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4281327 (10faidon) @cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let'... [21:18:44] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4281331 (10faidon) What are the symptoms? Furthermore, I noticed in one of the task PXE being mentioned. Is the issue just with PXE (e.g. not showing up in the boot order) or does it happ... [21:19:14] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 11.83 seconds [21:19:59] thcipriani: Is it possible that scap is pausing after each worker completes, and won't continue unless I hit ? Like, debugging code perhaps? [21:20:08] I'm seeing really weird behavior. [21:20:17] !log awight@deploy1001 Finished deploy [ores/deploy@36037b6]: New badwords for ORES in English: T196468 (duration: 72m 56s) [21:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:23] T196468: Catch a specific new badword in English - https://phabricator.wikimedia.org/T196468 [21:20:23] is it prompting you for input? [21:20:27] no. [21:20:44] I've never heard of that happening before [21:21:04] we don't have any debugging flag that would cause that behavior [21:21:28] thcipriani: https://phabricator.wikimedia.org/P7256 [21:22:22] I found it hung after waiting a very long time (1h15) for an ORES deployment to finish. [21:22:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.071 second response time [21:22:50] The last time I reconnected to screen, I started hitting to see if the process was dead [21:23:01] it gave me line 26 in that paste [21:24:03] then, each time I hit return after that, it would print another status line. But note that L31-41 are a perfect sequence, though I waited different lengths of time. [21:25:24] * awight runs and hides [21:27:32] hrm so looking at the deploy-log "executing fetch-checks" happened at 20:29:24 then nothing in the logs until 21:16:09 [21:32:15] we do block for the last read of stderr and stdout of a process...maybe got hung there somehow, but not clear how/where exactly [21:34:53] Okay thanks, I'll chalk it up to late local time and maybe tmux fail, but will let you know if it happens again! [21:35:38] awight: where are you right now? [21:35:43] still europe? [21:36:37] greg-g: yep, UTC+2 [21:36:59] I've been transported to bicycle heaven, though. [21:37:23] amsterdam? [21:38:07] :D precisely. This city has a lot to love about it, like green space and things for kids to do [21:39:23] there are other green things I was told of [21:39:29] * Hauskatze hides [21:39:37] lol [21:40:55] hehe, yes the smells here remind me of my native country, People's Republic of Berkeley [21:42:01] awight 28 h 10 min flight, from £5,203, that's expensive. [21:42:42] awight: the University? [21:43:09] iirc it was in CA [21:43:15] but I'm not sure [21:43:33] (03PS1) 10DCausse: Bump extra version to 5.5.2.7 and highlighter version to 5.5.2.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/440249 [21:43:53] The entire town used to be known for its protests, Happenings, and high proportion of PhDs, but I'm afraid the whole thing may be an outdated fraud. [21:50:54] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [21:55:04] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.096 second response time [22:00:05] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.083 second response time [22:03:34] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [22:04:34] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [22:05:34] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [22:07:19] (03CR) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [22:07:42] (03PS14) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [22:07:45] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [22:08:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [22:09:35] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:09:54] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [22:09:55] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:10:14] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [22:10:25] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:16:24] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:18:50] (03CR) 10Gehel: [WIP] Allow multiple elasticsearch instances per host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [22:39:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1277.eqiad.wmnet are marked down but pooled [22:40:54] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180613T2300). [23:00:04] MaxSem and Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:22] o/ [23:00:39] Mine is not testable but it's better to monitor logs for a while [23:00:40] I'll do it [23:01:02] (03PS3) 10MaxSem: Graduate CodeMirror out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) [23:01:08] (03CR) 10MaxSem: [C: 032] Graduate CodeMirror out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) (owner: 10MaxSem) [23:02:32] (03Merged) 10jenkins-bot: Graduate CodeMirror out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) (owner: 10MaxSem) [23:02:50] (03CR) 10jenkins-bot: Graduate CodeMirror out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440178 (https://phabricator.wikimedia.org/T185030) (owner: 10MaxSem) [23:07:24] !log maxsem@deploy1001 Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440178/ (duration: 01m 00s) [23:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:31] (03PS3) 10MaxSem: Enable wp10 data storage in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432093 (https://phabricator.wikimedia.org/T192268) (owner: 10Ladsgroup) [23:08:35] (03CR) 10MaxSem: [C: 032] Enable wp10 data storage in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432093 (https://phabricator.wikimedia.org/T192268) (owner: 10Ladsgroup) [23:09:52] (03Merged) 10jenkins-bot: Enable wp10 data storage in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432093 (https://phabricator.wikimedia.org/T192268) (owner: 10Ladsgroup) [23:12:34] (03CR) 10jenkins-bot: Enable wp10 data storage in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432093 (https://phabricator.wikimedia.org/T192268) (owner: 10Ladsgroup) [23:15:51] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/432093/ (duration: 00m 58s) [23:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:03] Amir1: ^ [23:16:11] Thanks [23:16:17] let's wait for five minute [23:16:21] *minutes [23:21:54] It's all fine [23:21:58] logs are clean [23:21:59] \o/ [23:25:52] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4281549 (10Jdlrobson) Reflecting reality let's move this is sprint for visibility [23:26:55] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [23:28:04] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy