[00:00:05] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190704T0000) [00:00:07] Niharika: nice to see you too! By the way, one hates to be coarse in these joyful moments of partial blocks deployments, but I don't like partial blocks much ;-) [00:00:28] hauskatze: I'm curious to hear why. :) [00:00:32] nice to have 'em though [00:00:46] Niharika: I think my line of work does not benefit from them much [00:01:05] I mean, spambots and sock puppets ain't candidates for partial blocks [00:01:21] we'll have to disappoint jouncebot [00:02:10] subbu: jouncebot not displaying updated content? [00:02:16] jouncebot: refresh [00:02:17] I refreshed my knowledge about deployments. [00:02:25] jouncebot: next [00:02:25] In 23 hour(s) and 57 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190705T0000) [00:02:30] subbu: I was thinking about making it 24 windows each an hour long just to get the point across, but… :-) [00:02:30] jouncebot: now [00:02:30] For the next 23 hour(s) and 57 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190704T0000) [00:02:36] hauskatze ... deploy to fix an unbreak patch. [00:02:44] ah, so unscheduled [00:02:46] alright [00:02:49] :) [00:03:02] I've signed it off as the duty deploy manager. [00:03:04] Or something. [00:03:35] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy PB to wikisource, wikivoyage and wiktionary projects; T218626 (duration: 00m 50s) [00:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:41] T218626: [Epic] Partial block rollout - https://phabricator.wikimedia.org/T218626 [00:03:45] hauskatze: Got it. Do you think they can be made more useful for your line of work? [00:05:37] (03PS2) 10Dzahn: icinga/elasticsearch: fix notes_link->notes_url parameter name [puppet] - 10https://gerrit.wikimedia.org/r/520654 [00:06:14] greg-g, fyi .. reg unscheduled deploy. [00:07:09] Niharika: not sure, but I could think about it tomorrow; because it's late in here [00:07:32] gn8 [00:08:59] Night! [00:09:08] James_F I had a phabricator deploy scheduled for now [00:09:11] :-/ [00:09:57] I guess it can wait [00:10:06] (03CR) 10Dzahn: "this caused an issue on icinga1001 that i saw by chance." [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [00:12:09] (03PS3) 10Dzahn: icinga/elasticsearch: fix notes_link->notes_url parameter name [puppet] - 10https://gerrit.wikimedia.org/r/520654 (https://phabricator.wikimedia.org/T183177) [00:13:00] twentyafterfour: Oh, sorry, go for it. Won't affect prod. [00:13:14] (03CR) 10Dzahn: [C: 03+2] icinga/elasticsearch: fix notes_link->notes_url parameter name [puppet] - 10https://gerrit.wikimedia.org/r/520654 (https://phabricator.wikimedia.org/T183177) (owner: 10Dzahn) [00:13:52] twentyafterfour: But maybe greg removed the Phab deploy window so that you could have more of a holiday? ;-) [00:14:15] well, the phab deploy is weeks over due [00:14:59] greg-g: around? [00:15:02] !log cscott@deploy1001 Started deploy [parsoid/deploy@af5fd0e]: Updating Parsoid to d355bc90 (deploy-20170703 branch, T227216) [00:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:07] T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216 [00:20:34] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Reedy) [00:21:50] !log cscott@deploy1001 Finished deploy [parsoid/deploy@af5fd0e]: Updating Parsoid to d355bc90 (deploy-20170703 branch, T227216) (duration: 06m 48s) [00:21:53] twentyafterfour: Just go for it. [00:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:55] T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216 [00:25:47] (03PS1) 10Dzahn: icinga/elasticsearch: fix notes_url->dashboard_links param name [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) [00:26:11] (03CR) 10jerkins-bot: [V: 04-1] icinga/elasticsearch: fix notes_url->dashboard_links param name [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) (owner: 10Dzahn) [00:26:38] ^ that was an unbreaknow fix for Parsoid (T227216) that we just deployed [00:26:48] sorry, i should really have given ops a little more warning. [00:27:17] oh, i see that subbu did ping in here first. good. sorry, i should have done so too. [00:27:37] cscott, all good. [00:27:46] phabricator will be offline for just a moment for upgrade. [00:27:52] starting in ~2 minutes [00:27:57] !log Deploying Phabricator release/2019-07-03/1 from wmf/stable [00:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:19] (03PS2) 10Dzahn: icinga/elasticsearch: remove notes_url param where it does not belong [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) [00:30:16] (03CR) 10Dzahn: [C: 03+2] icinga/elasticsearch: remove notes_url param where it does not belong [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) (owner: 10Dzahn) [00:33:13] (03CR) 10Dzahn: [C: 03+2] "this fixed the puppet run on icinga1001 and applied a LOT of changes coming from the original I9f06cc7d3090" [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) (owner: 10Dzahn) [00:34:37] (03CR) 10Dzahn: [C: 03+2] "also the change that gives Icinga a user agent has been applied now" [puppet] - 10https://gerrit.wikimedia.org/r/520656 (https://phabricator.wikimedia.org/T183177) (owner: 10Dzahn) [00:35:53] (03CR) 10Dzahn: "actually then https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520656/ and that fixed the puppet run on icinga1001 which applied a " [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [00:36:39] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:41:01] !log phabricator upgrade complete [00:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:57] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [00:43:35] ^ arrr. yea.. chain reaction [00:43:42] puppet did not run because of bug 1 [00:44:01] so then i fixed that and puppet ran again. that applied the next code change which has bug 2 [00:44:23] that one is just a missing contact now that i need to create [00:46:07] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:50:01] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:51:59] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [00:52:11] there's a huge "View Task" button on Phab emails now :o [00:54:29] (03CR) 10Dzahn: [C: 03+2] "adding missing contact "thcipriani" in private repo. notification method email, timezone 24x7" [puppet] - 10https://gerrit.wikimedia.org/r/512292 (owner: 10Dzahn) [01:11:03] PROBLEM - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1437.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:12:13] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:12:19] PROBLEM - MariaDB Slave Lag: m3 on db1128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:12:21] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:19:56] PROBLEM - MariaDB Slave Lag: m3 on db1128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:21:00] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:43:40] RECOVERY - MariaDB Slave Lag: m3 on db1117 is OK: OK slave_sql_lag Replication lag: 48.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:48:49] RECOVERY - MariaDB Slave Lag: m3 on db1128 is OK: OK slave_sql_lag Replication lag: 31.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:26:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team-TODO (201907): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10hashar) 05Open→03Resolved Indeed. What Daniel has said :-] [03:31:18] (03PS1) 10Felipe L. Ewald: Dead FTP link [dumps/html] - 10https://gerrit.wikimedia.org/r/520668 [03:33:44] (03Abandoned) 10Felipe L. Ewald: Dead FTP link [dumps/html] - 10https://gerrit.wikimedia.org/r/520668 (owner: 10Felipe L. Ewald) [04:44:05] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10Marostegui) 05Open→03Resolved All good now - thanks! ` root@db2049:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DD260) Port Name: 1I Port N... [04:52:03] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/520672 [05:00:44] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) [05:02:46] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:07:43] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 11474.21 seconds Marostegui T227251 - The acknowledgement expires at: 2019-07-05 05:07:16. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:07:43] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 11474.99 seconds Marostegui T227251 - The acknowledgement expires at: 2019-07-05 05:07:16. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:11:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520673 (https://phabricator.wikimedia.org/T227062) [05:12:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520673 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:13:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520673 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:14:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520673 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:16:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101 for upgrade (duration: 00m 50s) [05:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:29] !log Upgrade db1101 - T227062 [05:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:34] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:29:04] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520674 [05:30:12] (03CR) 10Marostegui: [C: 04-2] "Still catching up" [puppet] - 10https://gerrit.wikimedia.org/r/520672 (owner: 10Marostegui) [05:30:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520674 (owner: 10Marostegui) [05:31:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520674 (owner: 10Marostegui) [05:31:34] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520674 (owner: 10Marostegui) [05:33:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 after upgrade (duration: 00m 49s) [05:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:40] (03CR) 10Jcrespo: "> Patch Set 1:" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:53:33] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) I'm cleaning up the worker queue to lighten the load. It should subside soon. [05:57:12] !log disabled phd on phab1003 while I clean things up. Registered the downtime in icinga [05:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:56] (03PS4) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [06:27:58] (03PS12) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [06:28:55] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [06:35:03] (03CR) 10Jcrespo: "@Filippo I have applied all your suggestions (thank you very much!) except the "keep prometheus socket information on zarcillo". At the mo" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [06:36:53] (03PS13) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [06:39:58] (03PS1) 10Elukey: profile::hadoop::spark2: allow a port range for the driver [puppet] - 10https://gerrit.wikimedia.org/r/520683 (https://phabricator.wikimedia.org/T170826) [06:42:09] !log update puppet compiler's facts [06:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:38] (03CR) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [06:45:35] !log restarting archiva on archiva.wikimedia.org to pick up Java security update [06:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:42] (03CR) 10Muehlenhoff: [C: 03+1] "Fantastic \o/" [puppet] - 10https://gerrit.wikimedia.org/r/520683 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [06:56:04] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17221/ - looks good for a test!" [puppet] - 10https://gerrit.wikimedia.org/r/520683 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [06:56:22] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520686 [06:57:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520686 (owner: 10Marostegui) [06:58:44] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520686 (owner: 10Marostegui) [06:59:00] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520686 (owner: 10Marostegui) [06:59:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1101 after upgrade (duration: 00m 48s) [06:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:24] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:00:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:26] (03CR) 10Jcrespo: "Tested on core_test already." [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [07:03:34] (03CR) 10Marostegui: [C: 03+1] "Sweeeet!" [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [07:06:42] (03CR) 10Marostegui: "Cool!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [07:08:01] (03PS1) 10Elukey: role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) [07:09:07] !log rebooting restbase-dev* for kernel security updates [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:02] (03PS4) 10Ema: cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) [07:11:26] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520690 [07:12:32] (03PS2) 10Elukey: role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) [07:14:28] (03CR) 10Marostegui: ">" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [07:15:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520690 (owner: 10Marostegui) [07:16:10] (03PS5) 10Ema: cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) [07:16:24] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520690 (owner: 10Marostegui) [07:16:44] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520690 (owner: 10Marostegui) [07:16:50] (03CR) 10Ema: [C: 03+2] cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [07:17:10] (03PS3) 10Elukey: role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) [07:17:47] !log depool cp1078 and reimage as upload_ats T226638 [07:17:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1101 after upgrade (duration: 00m 49s) [07:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:55] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [07:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:26] (03CR) 10Jcrespo: "> Patch Set 5:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [07:19:07] (03PS4) 10Elukey: role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) [07:19:14] (03CR) 10Marostegui: "> > Patch Set 5:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [07:20:18] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1078.eqiad.wmnet'] ` The log can be found in `... [07:20:27] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17225/an-tool1006.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [07:20:44] (03PS5) 10Elukey: role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) [07:21:12] (03CR) 10Elukey: [C: 03+2] "I had to follow up with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520688/, ENOCOFFEE :)" [puppet] - 10https://gerrit.wikimedia.org/r/520683 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [07:21:44] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: fix spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520688 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [07:24:57] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520693 [07:26:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520693 (owner: 10Marostegui) [07:27:35] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520693 (owner: 10Marostegui) [07:27:51] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520693 (owner: 10Marostegui) [07:28:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1101 after upgrade (duration: 00m 49s) [07:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:53] (03PS1) 10Vgutierrez: Release 0.18 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/520694 (https://phabricator.wikimedia.org/T225945) [07:36:43] (03CR) 10Vgutierrez: [C: 03+2] Release 0.18 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/520694 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:39:31] (03CR) 10jenkins-bot: Release 0.18 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/520694 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:40:54] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520695 (https://phabricator.wikimedia.org/T227166) [07:43:43] (03CR) 10Marostegui: [C: 03+1] "let's merge and start testing carefully then" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [07:47:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520695 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [07:48:32] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520695 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [07:48:48] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520695 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [07:49:05] (03PS1) 10Vgutierrez: acme_chief: Enforce staging time validation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520697 (https://phabricator.wikimedia.org/T225945) [07:49:07] (03PS1) 10Vgutierrez: Release 0.18 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520698 (https://phabricator.wikimedia.org/T225945) [07:49:09] (03PS1) 10Vgutierrez: debian: Add release 0.18 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) [07:49:57] (03CR) 10jerkins-bot: [V: 04-1] debian: Add release 0.18 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:50:13] uh [07:50:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1069 from config as it will be decommissioned T227166 (duration: 00m 49s) [07:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:32] T227166: decommission db1069 - https://phabricator.wikimedia.org/T227166 [07:51:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1069 from config as it will be decommissioned T227166 (duration: 00m 48s) [07:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:01] (03CR) 10Vgutierrez: "recheck" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:55:31] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [07:55:42] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [07:55:42] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:52] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [07:56:00] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [07:56:15] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [07:56:32] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [07:56:53] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [07:57:40] !log rebooting cumin2001 for kernel security update [07:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:51] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Enforce staging time validation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520697 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:57:56] (03CR) 10Vgutierrez: [C: 03+2] Release 0.18 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520698 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:58:14] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.18 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:00:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] adding a buster docker base image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [08:00:17] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1078.eqiad.wmnet'] ` and were **ALL** successful. [08:00:38] (03Merged) 10jenkins-bot: acme_chief: Enforce staging time validation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520697 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:00:42] !log rearmed keyholder on cumin2001 [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:47] (03Merged) 10jenkins-bot: Release 0.18 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520698 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:00:58] (03Merged) 10jenkins-bot: debian: Add release 0.18 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:03:18] (03CR) 10jenkins-bot: acme_chief: Enforce staging time validation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520697 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:03:34] (03CR) 10jenkins-bot: Release 0.18 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520698 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:03:38] (03CR) 10jenkins-bot: debian: Add release 0.18 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/520699 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:08:29] !log Upgrade db2044 - T226952 [08:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:34] T226952: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 [08:14:44] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:20:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:20:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] !log rebooting cumin1001 for kernel security update [08:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log pool cp1078 w/ ATS backend T226638 [08:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:15] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [08:22:39] !log uploaded acme-chief 0.18 to apt.wikimedia.org (buster) - T225945 [08:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:44] T225945: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 [08:23:45] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) >>! In T227041#5303963, @aborrero wrote: > Some questions I have. Do we have a single ganeti hypervisor in each row? Could you... [08:25:41] !log rearmed keyholder on cumin1001 [08:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:13] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) @MSantos @Mathew.onipe we're moving from graphite-based varnish metrics to prometheus-based varnish metrics, I see you were amongst the au... [08:27:55] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) Since the 1s timeout change didn't seem to have changed things, could we revert it please? [08:29:17] !log upgrading acme-chief to version 0.18 in acme-chief test instances - T225945 [08:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:22] T225945: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 [08:29:37] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [08:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:52] (03PS2) 10Marostegui: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/520672 [08:32:55] (03CR) 10Marostegui: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/520672 (owner: 10Marostegui) [08:40:17] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/520672 (owner: 10Marostegui) [08:40:40] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:41:44] !log Repool labsdb1011 - T222978 [08:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:54] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [08:43:13] (03PS1) 10Elukey: role::statistics::explorer: add base firewall [puppet] - 10https://gerrit.wikimedia.org/r/520706 (https://phabricator.wikimedia.org/T170826) [08:59:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:59:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] !log rebooting deploy1001 for kernel security update [09:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:42] !log partly rearmed keyholder on deploy1001 (missing for apache2modsec) [09:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:47] (03CR) 10Elukey: "Moritz: whenever you have time let me know if this change makes sense or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:09:21] (03PS1) 10Filippo Giunchedi: Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) [09:10:22] (03CR) 10jerkins-bot: [V: 04-1] Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [09:18:18] (03PS2) 10Filippo Giunchedi: Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) [09:21:02] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) The etherpad is ready with the procedure and ready for a review. The patch is also ready for review: https://gerrit.wikimedia.org/r/#/... [09:21:16] ACKNOWLEDGEMENT - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. Muehlenhoff Deployment keyholders rearmed, but mod2sec blocked on https://phabricator.wikimedia.org/T224887#5232493 https://wikitech.wikimedia.org/wiki/Keyholder [09:25:09] !log uploaded scap_3.11.0-1 to {jessie,stretch,buster}-wikimedia APT - T227225 [09:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:14] T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) - https://phabricator.wikimedia.org/T227225 [09:27:11] (03PS3) 10Filippo Giunchedi: Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) [09:28:59] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17228/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [09:29:13] (03PS2) 10Filippo Giunchedi: logstash: set retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/520409 (https://phabricator.wikimedia.org/T220103) [09:29:27] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: set retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/520409 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [09:36:39] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:36:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] !log rebooting netmon1002 for kernel security update [09:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[curl],Package[openssh-client],Package[openssh-server] [09:39:44] (03Abandoned) 10Hashar: hhvm: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/474915 (owner: 10Hashar) [09:41:25] !log rearmed keyholder on netmon1002 [09:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:13] (03PS1) 10Filippo Giunchedi: eqsin: send logs to centrallog1001 too [puppet] - 10https://gerrit.wikimedia.org/r/520713 (https://phabricator.wikimedia.org/T200706) [09:48:04] (03PS4) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) [09:48:06] (03PS1) 10Elukey: profie::analytics::cluster::client: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) [09:51:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/17230/" [puppet] - 10https://gerrit.wikimedia.org/r/520713 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [09:52:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:52:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:45] (03PS1) 10Jbond: icinga: notes_url remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/520715 [09:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:00] (03PS3) 10Hashar: zuul: log stack dump to their own file [puppet] - 10https://gerrit.wikimedia.org/r/505253 [09:53:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar) [09:55:53] !log rolling reboot of kubestagetcd* to pick up MDS-enabled qemu [09:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:17] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:56:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:37] (03PS2) 10Elukey: profie::analytics::cluster::client: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) [09:57:39] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [09:58:25] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:01:07] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [10:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] PROBLEM - High lag on wdqs1003 is CRITICAL: 5411 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:03:14] (03CR) 10Fsero: adding a buster docker base image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [10:03:15] RECOVERY - Free Blazegraph allocators wdqs-blazegraph on wdqs1006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [10:03:31] (03PS3) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [10:04:10] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Peachey88) [10:05:35] (03CR) 10Jbond: [C: 03+2] icinga: notes_url remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/520715 (owner: 10Jbond) [10:14:26] (03PS1) 10Volans: commands: perform GC on KernelVersion objects too [software/debmonitor] - 10https://gerrit.wikimedia.org/r/520720 [10:14:33] (03CR) 10Jbond: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [10:15:44] (03PS10) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [10:17:05] jouncebot: next [10:17:05] In 13 hour(s) and 42 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190705T0000) [10:17:17] :D [10:19:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] RESTRouter: Add initial Helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [10:21:00] (03PS1) 10Jbond: icinga: revert regression [puppet] - 10https://gerrit.wikimedia.org/r/520722 [10:21:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Single comment, rest LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [10:22:01] (03CR) 10Jbond: [C: 03+2] icinga: revert regression [puppet] - 10https://gerrit.wikimedia.org/r/520722 (owner: 10Jbond) [10:25:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Adding Moritz for an input on whether we will have the "thirdparty" component ever on buster. Input on backports would also be nice. Thank" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [10:32:07] (03PS12) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:32:09] (03PS1) 10Jcrespo: Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:32:35] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:32:37] (03CR) 10jerkins-bot: [V: 04-1] Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [10:33:03] jouncebot: now [10:33:03] For the next 13 hour(s) and 26 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190704T0000) [10:33:18] oh, it switches at midnight, that’s why [10:34:26] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10aborrero) OK, thanks. So I think we could have 2 or 3 VMs in the same row, but running on different hardware, which more or less serves ou... [10:35:01] (03PS2) 10Jcrespo: Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:36:11] 10Operations, 10Performance-Team: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) p:05Triage→03Normal [10:36:18] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) A few notes I gathered while comparing an Dell system (`ms-be2049`) that runs `powersave` and AFAICT has no performance issues (i.e. low reported cpu load, ~10%) with an HP system (`ms-be2037`... [10:37:12] (03CR) 10Fsero: "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [10:37:27] (03PS9) 10Elukey: mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [10:37:46] (03CR) 10jerkins-bot: [V: 04-1] Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [10:39:23] (03PS3) 10Jcrespo: Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:39:51] (03CR) 10jerkins-bot: [V: 04-1] Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [10:40:46] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/520720 (owner: 10Volans) [10:41:25] (03PS4) 10Jcrespo: Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:41:51] (03CR) 10jerkins-bot: [V: 04-1] Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [10:43:17] (03CR) 10Volans: [C: 03+2] commands: perform GC on KernelVersion objects too [software/debmonitor] - 10https://gerrit.wikimedia.org/r/520720 (owner: 10Volans) [10:45:29] (03Merged) 10jenkins-bot: commands: perform GC on KernelVersion objects too [software/debmonitor] - 10https://gerrit.wikimedia.org/r/520720 (owner: 10Volans) [10:45:38] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) @mmodell there has not been any significant change to the amount of INSERTs the master is getting https://grafana.wikimedia.org/d/000000273/mysql?... [10:45:44] (03PS10) 10Elukey: mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [10:45:46] (03PS1) 10Elukey: Enable mcrouter async replication to codfw on mw1261 and mw1276 [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) [10:46:04] (03PS5) 10Jcrespo: Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:46:42] (03CR) 10jerkins-bot: [V: 04-1] Fix style problems on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [10:47:29] !log Ease replication consistency option on db2065 to allow it to catch a bit - T227251 [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:35] T227251: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 [10:47:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:48:53] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) >>! In T227251#5306333, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/A... [10:49:04] (03PS6) 10Jcrespo: Fix style and unit tests on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 [10:50:31] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17232/" [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [10:53:02] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) This is what the new config should look like (from the puppet compiler): ` { "pools": {... [11:01:11] (03PS1) 10Effie Mouzeli: Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 [11:02:40] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10MoritzMuehlenhoff) p:05Triage→03Normal Hi David, this needs to be signed off by your manager, can you please ask him/her to confirm on this task? [11:08:09] RECOVERY - MariaDB Slave Lag: m3 on db2065 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:15:05] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:21:34] (03PS1) 10Vgutierrez: Add ncredir[12]001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/520729 (https://phabricator.wikimedia.org/T133548) [11:25:59] (03PS1) 10Muehlenhoff: Add buster to debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/520730 [11:27:58] (03CR) 10Muehlenhoff: [C: 03+2] Add buster to debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/520730 (owner: 10Muehlenhoff) [11:29:49] jouncebot next [11:29:49] In 12 hour(s) and 30 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190705T0000) [11:29:54] (03CR) 10Ema: [C: 03+1] Add ncredir[12]001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/520729 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:30:11] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1068 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:30:19] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir[12]001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/520729 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:34:01] (03CR) 10Ema: [C: 03+1] eqsin: send logs to centrallog1001 too [puppet] - 10https://gerrit.wikimedia.org/r/520713 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [11:34:59] (03CR) 10Hashar: [C: 03+1] "Should be good. The puppet compiler does not show any difference since it never compare the content of files (the git diff is enough for t" [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar) [11:35:37] (03PS1) 10Jcrespo: mariadb: Depool db1109 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520732 (https://phabricator.wikimedia.org/T227062) [11:36:55] PROBLEM - Check whether ferm is active by checking the default input chain on netmon1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:37:27] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:37:43] looking [11:38:21] RECOVERY - Check whether ferm is active by checking the default input chain on netmon1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:38:51] !log upgraded scap to 3.11.0-1 on A:mw-canary - T227225 [11:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:56] T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) - https://phabricator.wikimedia.org/T227225 [11:42:45] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520732 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [11:43:29] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1109 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520732 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [11:44:25] (03Merged) 10jenkins-bot: mariadb: Depool db1109 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520732 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [11:46:32] (03CR) 10jenkins-bot: mariadb: Depool db1109 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520732 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [11:47:06] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 for upgrade (duration: 00m 45s) [11:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:20] * volans wathing [11:53:34] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 for upgrade (duration: 00m 50s) [11:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:45] 10Operations, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is all complete with the decom of labstore1003 [11:56:55] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10ema) 05Open→03Stalled >>! In T213769#5170476, @BBlack wrote: > Yeah, it's mostly just blocked on us making some time to deal with it, and time has been in extremely short supply lately,... [11:57:53] 10Operations, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10ema) [11:59:20] !log stop and upgrade db1109 T227062 [11:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:29] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [12:01:34] !log upgrading buster installations to final frozen package state [12:01:36] 10Operations: keyholder: continue to arm keys if one fails - https://phabricator.wikimedia.org/T227272 (10Volans) [12:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:45] moritzm: FYI I've opened ^^ [12:02:03] 10Operations: keyholder: continue to arm keys if one fails - https://phabricator.wikimedia.org/T227272 (10Volans) p:05Triage→03Normal [12:03:53] ack, good idea [12:05:05] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [12:06:56] (03PS1) 10Hashar: contint1001: point Docker data to a different partition [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) [12:07:39] (03PS2) 10Hashar: contint1001: point Docker data to a different partition [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) [12:07:51] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [12:08:56] (03CR) 10Hashar: "In short:" [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [12:09:13] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10MoritzMuehlenhoff) [12:09:15] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian buster - https://phabricator.wikimedia.org/T213546 (10MoritzMuehlenhoff) 05Open→03Resolved We have 22 servers running Buster at this point and Puppet is working well for us. [12:12:20] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) @aaron sorry for the delay in working on this! Before starting, I'd like to make a plan ab... [12:12:46] (03CR) 10Hashar: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/212/" [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [12:14:57] (03PS1) 10Jcrespo: mariadb: Repool db1109 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520739 (https://phabricator.wikimedia.org/T227062) [12:16:40] (03CR) 10Ema: "Wow this is really a ton of new stuff: we're doubling the number of services on lvs1014 from 12 to 24. @BBlack is this known and OK?" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [12:16:49] (03PS5) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) [12:19:26] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:19:40] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1109 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520739 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [12:20:37] (03Merged) 10jenkins-bot: mariadb: Repool db1109 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520739 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [12:20:52] (03CR) 10jenkins-bot: mariadb: Repool db1109 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520739 (https://phabricator.wikimedia.org/T227062) (owner: 10Jcrespo) [12:20:59] !log Started a Wikidata JSON dump run (sudo -b -u dumpsgen /usr/local/bin/dumpwikidatajson.sh) on snapshot1008 (T227207) [12:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:08] T227207: Wikibase JSON output (dumps, Special:EntityData) lacks qualifier hashes - https://phabricator.wikimedia.org/T227207 [12:21:15] apergos: ^ FYI [12:21:41] that will finish sometime Saturday I guess [12:21:48] ok, thanks for the heads up [12:21:57] presumably, yes [12:22:29] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [12:24:24] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 with low weight (duration: 00m 49s) [12:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:07] 10Operations, 10Wikimedia-SVG-rendering: Install (currently non-existing) Debian packages for PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181 (10MoritzMuehlenhoff) The Paratype font is now packaged in Debian (although not in the version we run on the image scalers/Thumbor, but we... [12:26:55] (03CR) 10Muehlenhoff: "The approach seems sensible, but needs the respective keytab first." [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:27:10] (03PS3) 10Muehlenhoff: Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411 [12:31:18] (03PS2) 10Effie Mouzeli: Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) [12:31:45] (03CR) 10jerkins-bot: [V: 04-1] Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) (owner: 10Effie Mouzeli) [12:32:02] (03CR) 10Elukey: "Yep I'll deploy it before turning on the kerberos flag.." [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:32:22] (03PS3) 10Elukey: profie::analytics::cluster::client: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) [12:32:35] (03CR) 10Jcrespo: [C: 03+1] Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [12:33:19] (03PS3) 10Effie Mouzeli: Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) [12:34:03] (03PS1) 10Ema: cache: reimage cp1080 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520741 (https://phabricator.wikimedia.org/T226638) [12:34:53] (03PS2) 10Filippo Giunchedi: eqsin: send logs to centrallog1001 too [puppet] - 10https://gerrit.wikimedia.org/r/520713 (https://phabricator.wikimedia.org/T200706) [12:35:37] (03CR) 10Filippo Giunchedi: [C: 03+2] eqsin: send logs to centrallog1001 too [puppet] - 10https://gerrit.wikimedia.org/r/520713 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [12:36:22] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17237/" [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:36:28] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T227076 (10jijiki) I will update its firwmare (like others in T141756) and see what happens [12:36:31] (03PS4) 10Elukey: profie::analytics::cluster::client: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/520714 (https://phabricator.wikimedia.org/T226698) [12:37:23] 10Operations, 10ops-eqiad, 10serviceops: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 (10jijiki) [12:39:45] !log depool cp1080 and reimage as upload_ats T226638 [12:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:51] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [12:40:26] !log upgraded scap to 3.11.0-1 on deploy[12]001 - T227225 [12:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:33] T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) - https://phabricator.wikimedia.org/T227225 [12:40:57] (03CR) 10Ema: [C: 03+2] cache: reimage cp1080 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520741 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [12:41:04] (03PS2) 10Ema: cache: reimage cp1080 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520741 (https://phabricator.wikimedia.org/T226638) [12:42:02] (03PS1) 10Elukey: profile::analytics::cluster::client: enable krb auth in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/520745 (https://phabricator.wikimedia.org/T226698) [12:42:10] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Gehel) 05Open→03Resolved All nodes reimaged, we're good for the moment [12:42:56] !log Restore defaults replication consistency options on db2065 - T227251 [12:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:06] T227251: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 [12:43:58] (03PS2) 10Elukey: profile::analytics::cluster::client: enable krb auth in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/520745 (https://phabricator.wikimedia.org/T226698) [12:44:01] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) I have restored the defaults after db2065 caught up [12:44:36] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1080.eqiad.wmnet'] ` The log can be found in `... [12:44:41] (03PS1) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [12:44:43] (03PS1) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [12:44:52] (03PS3) 10Elukey: profile::analytics::cluster::client: enable krb auth in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/520745 (https://phabricator.wikimedia.org/T226698) [12:45:37] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 (owner: 10Jbond) [12:45:55] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [12:46:03] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: enable krb auth in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/520745 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:46:57] (03PS2) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [12:47:48] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 (owner: 10Jbond) [12:48:04] (03PS1) 10Jcrespo: Normalize package name among all tools [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520749 [12:49:34] (03PS1) 10Hashar: Output summary of build errors upon completion [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/520751 [12:50:10] (03CR) 10Hashar: [C: 04-1] "WIP. Will have to test it a bit more. In short it is easy to miss a build failure when logging is send to stdout :(" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/520751 (owner: 10Hashar) [12:52:28] (03PS1) 10Elukey: profile::analytics::cluster::client: fix sudo::user declaration [puppet] - 10https://gerrit.wikimedia.org/r/520752 [12:53:57] (03PS3) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [12:53:59] (03PS2) 10Jcrespo: Normalize package name among all tools [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520749 [12:54:03] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: fix sudo::user declaration [puppet] - 10https://gerrit.wikimedia.org/r/520752 (owner: 10Elukey) [12:54:22] PROBLEM - puppet last run on an-tool1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [12:56:43] (03PS4) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [12:57:23] (03PS2) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [12:57:34] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 (owner: 10Jbond) [12:58:09] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [12:59:32] RECOVERY - puppet last run on an-tool1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:01:23] (03PS5) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [13:02:16] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 (owner: 10Jbond) [13:02:22] (03PS3) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [13:02:59] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:03:16] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10MoritzMuehlenhoff) There was a pre-existing bugreport in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918250 I've followed up there along with the offer to help get that fix... [13:05:58] !log upgraded scap to 3.11.0-1 on A:codfw - T227225 [13:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:03] T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) - https://phabricator.wikimedia.org/T227225 [13:07:33] (03PS6) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [13:15:03] !log reboot ms-be2037 after setting "os control" for power regulator mode - T225713 [13:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:08] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [13:16:09] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) According to this patch https://patchwork.kernel.org/patch/10530095/ the interface used by `pcc-cpufreq` isn't scalable with many (>4) CPUs and shouldn't be used. AFAICT that patch isn't inclu... [13:21:45] (03PS1) 10Urbanecm: Add dotfiles for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/520753 [13:23:50] !log upgraded scap to 3.11.0-1 on A:eqiad - T227225 [13:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:56] T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) - https://phabricator.wikimedia.org/T227225 [13:27:41] (03PS4) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [13:28:41] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:28:44] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Volans) The blocker above has been fixed in scap, released and rolled out to the fleet. I'll proceed with the removal of python-conftool. [13:31:02] (03CR) 10Jbond: "Please check the notes_link for services you know about and let me know if there are better ones to use" [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:31:43] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1080.eqiad.wmnet'] ` and were **ALL** successful. [13:39:19] (03CR) 10Jcrespo: [C: 03+2] Fix style and unit tests on several scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520724 (owner: 10Jcrespo) [13:40:14] (03CR) 10Jcrespo: [C: 03+2] Normalize package name among all tools [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520749 (owner: 10Jcrespo) [13:40:50] !log pool cp1080 w/ ATS backend T226638 [13:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [13:43:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520753 (owner: 10Urbanecm) [13:46:00] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) Setting "os control" does indeed disable loading of `pcc-cpufreq` and governors now are the same as the dell host (i.e. linux drives the CPU p-states autonomously) ` ms-be2037:~$ cat /sys/dev... [13:51:26] !log removing python-conftool (old py2 version) from all hosts - T226965 [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:30] T226965: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 [13:53:09] (03PS5) 10Gehel: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:53:35] (03PS1) 10Jbond: icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 [13:54:07] (03CR) 10Gehel: [C: 03+1] "+1 for maps, elasticsearch and wdqs, no idea about the rest" [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:54:12] (03CR) 10jerkins-bot: [V: 04-1] icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 (owner: 10Jbond) [13:54:18] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [13:57:26] (03PS2) 10Jbond: icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 [13:58:13] (03CR) 10jerkins-bot: [V: 04-1] icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 (owner: 10Jbond) [13:59:46] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Volans) >>! In T226965#5295407, @Joe wrote: > What we need to do is: > > [x] Upgrade python3-etcd to the latest version > [x] Upgrade python3-conftool to the latest version > [x]... [14:00:02] (03PS6) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [14:00:27] (03PS3) 10Jbond: icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 [14:00:56] (03CR) 10jerkins-bot: [V: 04-1] monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [14:01:04] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Volans) I assume that this host will be reimaged, but in case it's not, please manually run: ` apt-get remove python-conftool ` once fixed. [14:01:25] (03CR) 10jerkins-bot: [V: 04-1] icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 (owner: 10Jbond) [14:01:49] 10Operations, 10Goal: TEC6: Database Automation - https://phabricator.wikimedia.org/T220395 (10Volans) [14:01:51] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Volans) 05Open→03Resolved [14:05:29] (03CR) 10Jbond: [C: 03+2] Add dotfiles for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/520753 (owner: 10Urbanecm) [14:08:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) (owner: 10Effie Mouzeli) [14:10:09] (03PS4) 10Muehlenhoff: Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411 [14:10:12] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/urbanecm/.gitconfig] [14:10:31] jbond42, seems to be caused by the puppet change? [14:11:06] if it's just few hosts it's a race condition in puppet when adding new files in the way we do it for the homes [14:11:13] Urbanecm: checking but i suspect it is just a trasnient issues as the files got deployed to the various puppet masters [14:11:21] Urbanecm: I'm re-running puppet there, likely just a blip/race [14:11:30] thanks everyone [14:11:44] yeah, all fine. it's just Puppet being Puppet :-) [14:11:51] thanks for keeping an eye on your changes ;) [14:11:53] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10MSantos) @fgiunchedi overall it looks good, just have one question. In the Varnish response time graph, do you know why eqiad p99 values are so differ... [14:12:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411 (owner: 10Muehlenhoff) [14:12:09] well, icinga pinged me since the message contained my username :) [14:12:31] automation by mistake :D [14:12:48] (03PS7) 10Jbond: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) [14:13:39] (03PS4) 10Jbond: icinga icon: Use correct icon for notes_url [puppet] - 10https://gerrit.wikimedia.org/r/520756 [14:14:57] (03PS1) 10Filippo Giunchedi: Enable centrallog1001 on all pops [puppet] - 10https://gerrit.wikimedia.org/r/520761 (https://phabricator.wikimedia.org/T200706) [14:15:34] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:22:36] (03PS1) 10Ema: cache: reimage cp1082 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520762 (https://phabricator.wikimedia.org/T226638) [14:24:43] (03PS1) 10Muehlenhoff: Update a number of comments still referring to Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520764 [14:27:37] (03CR) 10Ema: [C: 03+2] cache: reimage cp1082 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520762 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [14:27:42] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) >>! In T184942#5306825, @MSantos wrote: > @fgiunchedi overall it looks good, just have one question. In the Varnish response time graph, d... [14:28:06] (03PS2) 10Filippo Giunchedi: Enable centrallog1001 on all pops [puppet] - 10https://gerrit.wikimedia.org/r/520761 (https://phabricator.wikimedia.org/T200706) [14:28:08] !log depool cp1080 and reimage as upload_ats T226638 [14:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:14] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [14:28:43] (03CR) 10Muehlenhoff: adding a buster docker base image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [14:29:02] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1082.eqiad.wmnet'] ` The log can be found in `... [14:30:00] (03CR) 10Ema: [C: 03+1] Enable centrallog1001 on all pops [puppet] - 10https://gerrit.wikimedia.org/r/520761 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [14:31:38] (03CR) 10Filippo Giunchedi: [C: 03+2] Enable centrallog1001 on all pops [puppet] - 10https://gerrit.wikimedia.org/r/520761 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [14:31:46] (03PS3) 10Filippo Giunchedi: Enable centrallog1001 on all pops [puppet] - 10https://gerrit.wikimedia.org/r/520761 (https://phabricator.wikimedia.org/T200706) [14:36:14] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) @Marostegui: The phabricator work queue is almost empty now, see https://phabricator.wikimedia.org/daemon/ (There were well over 1 million jobs, now... [14:37:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/520764 (owner: 10Muehlenhoff) [14:38:15] (03PS1) 10Muehlenhoff: Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 [14:38:30] 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) 05Open→03Resolved a:03ema The maximum allowed request header size (field name + value) is now 8192 bytes. Closing. [14:38:36] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) From what I can see now, the UPDATEs have stopped, but the INSERT rate is still at the same level on the master: https://grafana.wikimedia.org/d/0... [14:39:29] (03CR) 10jerkins-bot: [V: 04-1] Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [14:41:12] (03PS2) 10Muehlenhoff: Update a number of comments still referring to Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520764 [14:43:45] (03CR) 10Muehlenhoff: [C: 03+2] Update a number of comments still referring to Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520764 (owner: 10Muehlenhoff) [14:44:08] (03PS1) 10Elukey: Add Ipv6 PTR/AAA records for an-worker* [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) [14:44:58] (03PS2) 10Elukey: Add Ipv6 PTR/AAA records for an-worker* [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) [14:48:40] (03PS3) 10Elukey: Add Ipv6 PTR/AAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) [14:49:18] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) I could cancel the rest of the search jobs, I think that would still produce quite a bit of database activity but maybe less than all the queue statu... [14:50:55] (03PS4) 10Fsero: adding a buster docker base image Updated: addresed comments from Alex and Moritz Change-Id: I5b7e5a0398ff347f3b56b66df5895e7a5aef0332 [puppet] - 10https://gerrit.wikimedia.org/r/520503 [14:51:20] (03CR) 10jerkins-bot: [V: 04-1] adding a buster docker base image Updated: addresed comments from Alex and Moritz Change-Id: I5b7e5a0398ff347f3b56b66df5895e7a5aef0332 [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [14:51:37] !log phabricator: lowered phd.taskmasters config to 1 from 10 [14:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:55] (03PS5) 10Fsero: adding a buster docker base image Updated: addresed comments from Alex and Moritz [puppet] - 10https://gerrit.wikimedia.org/r/520503 [14:52:20] (03CR) 10jerkins-bot: [V: 04-1] adding a buster docker base image Updated: addresed comments from Alex and Moritz [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [14:53:14] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) @Marostegui ok I found a way to slow down the queue: I lowered `phd.taskmasters` to 1 [14:54:12] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we make a commit to {rOPUP} [14:55:27] (03CR) 10Filippo Giunchedi: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:55:36] (03PS6) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [14:55:39] (03PS1) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [14:56:12] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [14:56:31] (03PS1) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520769 [14:57:11] (03CR) 10Fsero: "Alex, Moritz I've changed it as you suggested. let me know how it looks and if looks good i'll merge." [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [14:58:34] (03PS1) 1020after4: Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) [14:59:16] (03CR) 1020after4: [C: 03+1] Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) (owner: 1020after4) [14:59:20] (03CR) 10Fsero: [C: 03+2] If guard releases stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 (owner: 10Alexandros Kosiaris) [14:59:24] (03CR) 10Fsero: [V: 03+2 C: 03+2] If guard releases stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 (owner: 10Alexandros Kosiaris) [14:59:33] (03PS2) 10Fsero: If guard releases stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 (owner: 10Alexandros Kosiaris) [14:59:56] 10Operations, 10Phabricator, 10Patch-For-Review: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10mmodell) p:05Triage→03High [15:00:06] (03CR) 10Fsero: [V: 03+2 C: 03+2] Update admin/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/520415 (owner: 10Alexandros Kosiaris) [15:00:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The WMCS/labstore parts LGTM. No idea about the others." [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [15:00:46] (03CR) 10Fsero: [V: 03+2 C: 03+2] If guard releases stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 (owner: 10Alexandros Kosiaris) [15:02:43] (03PS1) 10Hashar: Add image metadata to Jinja context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/520772 [15:09:03] (03PS1) 10Muehlenhoff: prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) [15:17:20] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1082.eqiad.wmnet'] ` and were **ALL** successful. [15:17:38] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) (owner: 10Muehlenhoff) [15:18:06] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:20:47] (03PS1) 10Muehlenhoff: kmod::blacklist: Cleanup initramfs trigger [puppet] - 10https://gerrit.wikimedia.org/r/520774 [15:22:08] !log pool cp1082 w/ ATS backend T226638 [15:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:14] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [15:24:21] (03PS1) 10Elukey: geoip::data::archive: move to kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/520775 (https://phabricator.wikimedia.org/T226698) [15:31:42] (03PS1) 10Jbond: autid_hiera: some minor style changes [labs/private] - 10https://gerrit.wikimedia.org/r/520776 [15:32:46] (03PS1) 10Ema: cache: reimage cp1084 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520777 (https://phabricator.wikimedia.org/T226638) [15:34:40] (03PS1) 10Muehlenhoff: apache::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520778 [15:34:53] !log depool cp1084 and reimage as upload_ats T226638 [15:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:59] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [15:35:51] (03CR) 10Ema: [C: 03+2] cache: reimage cp1084 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520777 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [15:36:06] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/520776 (owner: 10Jbond) [15:36:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520774 (owner: 10Muehlenhoff) [15:36:23] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) [15:37:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1084.eqiad.wmnet'] ` The log can be found in `... [15:37:19] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) [15:41:20] 10Operations, 10VisualEditor: Something went wrong HTTP 404 when using Visual Editor - https://phabricator.wikimedia.org/T224384 (10Der_Keks) [15:43:29] (03PS1) 10Matěj Suchánek: Disable Wikidata for ProofreadPage namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520780 (https://phabricator.wikimedia.org/T227201) [15:51:36] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520780 (https://phabricator.wikimedia.org/T227201) (owner: 10Matěj Suchánek) [15:54:23] (03PS2) 10Jbond: autid_hiera: some minor style changes [labs/private] - 10https://gerrit.wikimedia.org/r/520776 [15:55:22] (03CR) 10Jbond: autid_hiera: some minor style changes (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/520776 (owner: 10Jbond) [15:56:36] (03PS1) 10Gehel: cloudelastic: use the proper check for SSL certificates [puppet] - 10https://gerrit.wikimedia.org/r/520782 [15:59:38] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520782 (owner: 10Gehel) [16:34:12] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1084.eqiad.wmnet'] ` and were **ALL** successful. [16:36:37] !log pool cp1084 w/ ATS backend T226638 [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:43] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [16:37:14] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10aborrero) For the next toolforge kubernetes cluster we have 2 approaches for etcd: * the one developed in this phabricator task: {T226098}. That one uses p... [16:41:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [16:46:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [17:29:07] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 50 probes of 433 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:34:33] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 20 probes of 433 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:01:44] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) Should these really be both in eqiad? The initial use case is for analytics, but we might very well come up with a use case outsi... [18:45:19] (03PS11) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [18:55:15] (03PS12) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [20:43:01] (03PS3) 10Krinkle: icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [20:43:23] moritzm: if you have a minute, still waiting for the above incinga email fix :) [20:56:44] (03PS13) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [21:15:07] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [21:21:25] gehel around by any chance? cannot ssh or ping, nothing on console. I can reboot it but not sure if it's safe to let it reenter the cluster without checks [21:23:14] according to https://wikitech.wikimedia.org/wiki/Service_restarts#Elasticsearch it should DTRT [21:26:23] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Volans) [21:31:32] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Volans) p:05Triage→03High The host is part of the main and psi clusters: ` $ confctl --quiet select name="elastic2054.codfw.wmnet" get {"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=... [21:31:33] volans: should not be an issue [21:32:06] gehel: ack, do you prefer to have it depooled first? [21:33:03] Ideally yes, but even that should be a non issue [21:33:28] If lvs see it up, it should be ready to serve requests [21:33:44] And it is codfw, no user traffic [21:34:03] ok leaving it pooled [21:34:04] I can be at my keyboard in 5' [21:34:11] no need [21:34:55] Any idea what happened to that server ? [21:35:00] !log forcing reboot of elastic2054 from console, host unresponsive - T227298 [21:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:06] T227298: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 [21:35:12] not yet, totally unreachable (ping, ssh, empty console) [21:35:34] will have a look at logs [21:35:51] it's rebooting now (BIOS) [21:36:40] OS booting [21:37:01] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [21:37:02] (03PS1) 10Volans: Updated src to v0.1.10 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/520813 [21:37:33] host up, I'm in [21:38:09] I'll have to update the service restart doc for elastic now that we have those nice cookbooks! [21:38:13] volans: thanks a lot! [21:38:26] ah right :) [21:41:01] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Volans) It first detected a CPU error and then a memory one, here the hardware logs: ` ------------------------------------------------------------------------------- Record: 4 Date/Time: 07/04/2019 21:12:... [21:42:03] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Volans) Both clusters back to green: ` elastic2054 0 ~$ curl -s localhost:9600/_cluster/health?pretty { "cluster_name" : "production-search-psi-codfw", "status" : "green", "timed_out" : false, "number_o... [21:43:30] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Volans) p:05High→03Normal @Gehel I'll leave the task open if you want to investigate more tomorrow for potential hardware parts to replace. (see above for hardware logs). [21:46:18] (03CR) 10Volans: [V: 03+2 C: 03+2] Updated src to v0.1.10 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/520813 (owner: 10Volans) [21:50:02] !log volans@deploy1001 Started deploy [debmonitor/deploy@0ee26a3]: Deploy Debmonitor v0.1.10 [21:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:50] !log volans@deploy1001 Finished deploy [debmonitor/deploy@0ee26a3]: Deploy Debmonitor v0.1.10 (duration: 00m 48s) [21:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:24] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100%