[00:33:15] (03PS6) 10Krinkle: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [00:33:21] (03PS4) 10Krinkle: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 (https://phabricator.wikimedia.org/T203589) (owner: 10Jforrester) [00:33:26] (03PS9) 10Krinkle: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [01:40:16] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 62.08 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [03:11:10] 10Operations, 10Core-Platform-Team, 10HHVM, 10PHP 7.0 support, 10User-ArielGlenn: Run all jobs on PHP7 - https://phabricator.wikimedia.org/T195392 (10Shizhao) [03:52:03] 10Operations, 10Core-Platform-Team, 10User-ArielGlenn: Run all jobs on PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) @Shizhao This is a tracking task for migrating the WMF JobQueue jobrunner infrastructure to PHP 7. The PHP7.0-support project is for tracking known problems with code that does... [04:09:06] (03PS1) 10Krinkle: openstack: Add redirect for /view/ on Wikitech [puppet] - 10https://gerrit.wikimedia.org/r/460467 (https://phabricator.wikimedia.org/T193848) [04:53:34] (03CR) 10Krinkle: "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/460467 (https://phabricator.wikimedia.org/T193848) (owner: 10Krinkle) [05:03:42] (03CR) 10Marostegui: [C: 032] mariadb: Repool db2068 with load load after recloning it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460390 (https://phabricator.wikimedia.org/T204127) (owner: 10Jcrespo) [05:05:14] (03Merged) 10jenkins-bot: mariadb: Repool db2068 with load load after recloning it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460390 (https://phabricator.wikimedia.org/T204127) (owner: 10Jcrespo) [05:06:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Increase weight for db2068 - T204127 (duration: 00m 52s) [05:06:38] (03PS1) 10Marostegui: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460469 [05:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:46] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [05:07:51] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460469 (owner: 10Marostegui) [05:09:12] (03Merged) 10jenkins-bot: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460469 (owner: 10Marostegui) [05:10:18] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 50s) [05:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:36] (03PS1) 10Marostegui: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460470 (https://phabricator.wikimedia.org/T189101) [05:13:48] (03CR) 10jenkins-bot: mariadb: Repool db2068 with load load after recloning it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460390 (https://phabricator.wikimedia.org/T204127) (owner: 10Jcrespo) [05:13:50] (03CR) 10jenkins-bot: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460469 (owner: 10Marostegui) [05:14:22] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460470 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [05:15:44] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460470 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [05:16:51] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2050 - T189101 (duration: 00m 49s) [05:16:54] !log Stop replication in sync on db1075 and db2050 - T189101 [05:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:59] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [05:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:12] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460471 [05:26:17] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460471 (owner: 10Marostegui) [05:27:28] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:28:02] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460471 (owner: 10Marostegui) [05:28:21] (03CR) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460470 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [05:28:23] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460471 (owner: 10Marostegui) [05:29:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2050 - T189101 (duration: 00m 49s) [05:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:11] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [05:32:05] (03PS1) 10Marostegui: db1075.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/460472 [05:34:35] !log Deploy schema change on s3 eqiad master - T187089 [05:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:42] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:31:26] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:44:37] (03PS2) 10Elukey: profile::analytics::database::meta: use the same prod config in labs [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) [06:46:57] (03Abandoned) 10Marostegui: db1075.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/460472 (owner: 10Marostegui) [06:49:37] (03PS1) 10Banyek: db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460476 [06:49:41] (03CR) 10Elukey: "Removed the .production part, and pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/12452/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [06:50:00] (03PS2) 10Muehlenhoff: mediawiki: Clean up php7 package list [puppet] - 10https://gerrit.wikimedia.org/r/459881 (owner: 10Legoktm) [06:52:28] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1075 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/460389 (https://phabricator.wikimedia.org/T148507) [06:54:42] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1075 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/460389 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [06:56:56] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:42] (03CR) 10Marostegui: [C: 031] db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460476 (owner: 10Banyek) [07:00:55] (03CR) 10Jcrespo: "All ok with any changes in this way- no blocker from us." [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [07:05:00] jynus: thanks! [07:05:22] (03CR) 10Banyek: [C: 032] db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460476 (owner: 10Banyek) [07:05:49] (03CR) 10Banyek: [V: 032 C: 032] db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460476 (owner: 10Banyek) [07:07:10] (03PS3) 10Elukey: profile::analytics::database::meta: use the same prod config in labs [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) [07:08:34] (03CR) 10Elukey: [C: 032] profile::analytics::database::meta: use the same prod config in labs [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [07:11:06] (03CR) 10jenkins-bot: db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460476 (owner: 10Banyek) [07:11:09] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s) [07:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:17] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [07:12:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10MoritzMuehlenhoff) a:05Kalliope>03ayounsi [07:13:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10MoritzMuehlenhoff) 05Open>03Resolved Closing the task, please reopen if it doesn't work for you. [07:18:53] (03CR) 10Muehlenhoff: [C: 032] mediawiki: Clean up php7 package list [puppet] - 10https://gerrit.wikimedia.org/r/459881 (owner: 10Legoktm) [07:19:00] (03PS3) 10Muehlenhoff: mediawiki: Clean up php7 package list [puppet] - 10https://gerrit.wikimedia.org/r/459881 (owner: 10Legoktm) [07:19:31] (03PS1) 10Jcrespo: mariadb: Prepare db1062 master for reimage [puppet] - 10https://gerrit.wikimedia.org/r/460478 (https://phabricator.wikimedia.org/T148507) [07:21:39] (03PS2) 10Jcrespo: mariadb: Prepare db1062 master for reimage [puppet] - 10https://gerrit.wikimedia.org/r/460478 (https://phabricator.wikimedia.org/T148507) [07:22:05] (03PS1) 10Jcrespo: mariadb: Reenable db1062 notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460480 [07:22:44] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare db1062 master for reimage [puppet] - 10https://gerrit.wikimedia.org/r/460478 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [07:28:42] !log stopping db1062 mariadb in preparation for reimage [07:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:41] !log rebooting mwmaint1001 for kernel security update [07:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:20] !log Deploy schema change on s4 eqiad master (db1068) - T67448 T114117 T51191 [07:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:31] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [07:49:32] T51191: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 [07:49:33] T67448: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 [08:02:10] 10Operations, 10ops-eqiad: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 (10jcrespo) p:05Triage>03High [08:05:38] (03PS1) 10Banyek: db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460484 [08:07:40] (03CR) 10Marostegui: [C: 031] db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460484 (owner: 10Banyek) [08:07:44] 10Operations, 10ops-eqiad: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 (10jcrespo) @Cmjohnson I have restarted the service on this host, I will put it back down when you ping us you are available, as it cannot be down for long periods of time CC @Marostegui [08:10:34] (03CR) 10Banyek: [C: 032] db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460484 (owner: 10Banyek) [08:11:00] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [08:11:49] (03Merged) 10jenkins-bot: db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460484 (owner: 10Banyek) [08:15:31] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s) [08:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:39] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [08:20:51] !log stopping and restarting db1069 for upgrade (x1 eqiad master) [08:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:32] (03CR) 10jenkins-bot: db-codfw.php: Increase weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460484 (owner: 10Banyek) [08:25:18] (03PS4) 10Marostegui: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [08:27:27] !log reboot kafka100[1-3] (eventbus eqiad) for kernel upgrades [08:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:45] !log rebooting acamar for kernel tests [08:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:03] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@32a81be]: (no justification provided) [08:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:09] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (0334 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [08:41:28] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.581 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:41:59] (03PS1) 10Jcrespo: mariadb: Prepare for db1070 (s5 eqiad master) reimage [puppet] - 10https://gerrit.wikimedia.org/r/460485 [08:44:14] (03PS2) 10Jcrespo: mariadb: Prepare for db1070 (s5 eqiad master) reimage [puppet] - 10https://gerrit.wikimedia.org/r/460485 [08:45:52] !log Deploy schema change on s7 eqiad master (db1062) - T89737 [08:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:00] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:47:08] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:54:28] (03CR) 10Ema: [C: 032] site: convert cache::misc hosts to spares [puppet] - 10https://gerrit.wikimedia.org/r/460217 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:54:35] (03PS2) 10Ema: site: convert cache::misc hosts to spares [puppet] - 10https://gerrit.wikimedia.org/r/460217 (https://phabricator.wikimedia.org/T164609) [08:59:15] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [09:01:09] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 134.4 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [09:02:12] there was a temp spike from the graphs [09:03:56] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp3007.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-r... [09:04:05] that maybe the noise created due to db maintenance [09:05:18] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:05:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:07:37] possibly yeah, spike was in gelf input [09:07:55] also these I bet can be silenced, for eqiad that is [09:08:22] yes, it looks like the alerts were triggered by this useless spike https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=eqiad&var-cache_type=upload&var-status_type=5&from=now-1h&to=now [09:08:38] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:08:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:10:27] ack, I've silenced them until oct 8th [09:10:32] thanks! [09:11:17] (03PS3) 10Jcrespo: mariadb: Prepare for db1070 (s5 eqiad master) reimage [puppet] - 10https://gerrit.wikimedia.org/r/460485 [09:12:46] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare for db1070 (s5 eqiad master) reimage [puppet] - 10https://gerrit.wikimedia.org/r/460485 (owner: 10Jcrespo) [09:14:14] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) I've set up a test rig for rsyslog + omkafka on wmcs `logging-jessie01` and `logging-stretch01` . With the former being the producer and... [09:24:48] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [09:24:49] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.3784 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [09:27:05] \o hi opsen / sre :) [09:27:28] * addshore wants to ban requests to a wikidata API module for a single IP / UA, are there docs for that? [09:28:13] addshore: ema might know. He is a varnish babysitter [09:29:12] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [09:31:32] hashar: thanks [09:32:09] maybe I could do it in mw-config? [09:32:16] I remember someone doing this a week or so ago *looks in SAL* [09:34:39] (03CR) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:39:54] 10Operations, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) Tagging operations as we want to block these requests [09:39:58] hashar: ^^ fyi thats the ticket [09:41:26] * addshore tries to read some docs [09:41:36] (03PS1) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460491 [09:41:56] (03CR) 10jerkins-bot: [V: 04-1] Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460491 (owner: 10Petar.petkovic) [09:45:52] hashar: how should one discover which ops are currently awake to do the varnish block? :P [09:46:06] (03PS1) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [09:47:25] (03PS2) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [09:47:45] !log operation stopping db1070 in preparation for reimage [09:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:54] (03Abandoned) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460491 (owner: 10Petar.petkovic) [09:53:30] wait... is that Ip internal.... [09:59:08] elukey: any ideas to move this forward? :) [10:02:38] addshore: I'd ask to the traffic team, ema might be around [10:02:59] it is a bit weird that webrequest [10:03:08] 10Operations, 10DBA, 10Epic: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) p:05Triage>03Normal [10:05:13] 10Operations, 10DBA, 10Epic: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Marostegui) [10:05:32] 10Operations, 10DBA, 10Epic: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by banyek on neodymium.eqiad.wmnet for hosts: ``` ['db1070.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf... [10:06:01] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Marostegui) [10:06:24] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Marostegui) [10:08:50] elukey: I can't tell where it is coming from, but my brain isnt full awake yet [10:09:09] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Marostegui) [10:11:14] its internal right? [10:13:23] or not... [10:16:33] tools-worker-1021.tools.eqiad.wmflabs ! [10:16:49] 10Operations, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) Looks like it is tools-worker-1021.tools.eqiad.wmflabs [10:17:10] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) [10:23:49] addshore: hey [10:24:03] looking [10:24:03] addshore: yeah I was about to say that, maybe it comes from labs [10:24:08] hello ema :) [10:24:33] yup, its coming from labs, trying to deal with it from the tools side now and get the process stopped [10:24:41] if I figure out what process it is... [10:30:06] addshore: so do we want to just stop all requests from that IP? [10:30:25] -cloud have just responded so will see if we can kill the toll instead :) [10:30:45] alright [10:36:25] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp1045.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-r... [10:37:27] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (0318 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:39:10] (03PS25) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [10:49:29] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10MoritzMuehlenhoff) As I had made a backport of the megaraid_sas driver for Perc 740/840 to the 4.9 stretch kernel anyway, I ran so... [10:50:01] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1045_v4, cp1045_v6 [10:52:51] ema: looks like we can't find which tool it is [10:53:01] so lets go ahead and block the IP I guess [10:53:34] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10aborrero) Ping @Pintoch , this seems to be your tool. [10:54:07] ema: if possible a block by UA & IP would be great so we don't break any other things running there [10:54:19] * addshore isn't sure what is possible easily in varnish, but I imagine most things are :P [10:54:40] 10Operations, 10MediaWiki-API, 10Availability, 10HHVM, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192 (10Liuxinyu970226) [10:57:30] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Pintoch) @aborrero thanks for the ping. I do not recognize the shape of the queries as coming from this tool though. The openrefi... [10:58:36] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1070.eqiad.wmnet'] ``` and were **ALL** successful. [10:59:29] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1045.eqiad.wmnet'] ``` and were **ALL** successful. [10:59:38] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) [11:05:07] (03PS6) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [11:05:09] (03PS2) 10Jcrespo: mariadb: Reenable db1062 notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460480 [11:05:11] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1070 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460506 [11:05:54] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10aborrero) >>! In T204267#4583629, @Pintoch wrote: > @aborrero thanks for the ping. I do not recognize the shape of the queries as... [11:08:46] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1070 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460506 (owner: 10Jcrespo) [11:08:56] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1070 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460506 [11:14:43] (03PS1) 10Banyek: db-codfw.php: Increase weight for db2068 (200) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 [11:24:09] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10aborrero) We stopped now the `corhist` tool which belongs to @Tpt, please check the tool. Now that the tool has been terminated,... [11:24:26] ema: eventually found the tool, no block needed [11:31:36] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) p:05Unbreak!>03High The API request rate has returned to a normal level: {F25850152} As have the SPARQL error cod... [11:33:48] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) a:05Addshore>03Smalyshev [11:37:12] addshore: nice! [11:54:58] PROBLEM - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 5 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [12:07:33] someone with proper rights please ban the spammer in #mediawiki [12:09:58] (03CR) 10Gehel: Elasticsearch module is coming up. (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:25:39] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1977 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:26:49] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [12:28:16] (03CR) 10Mobrovac: "Hm, I perhaps liked the "x" version more since it makes it clearer that we are ignoring the patch version." [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [12:30:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] Mysql client not available on mwdebug* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [12:40:19] (03CR) 10Marostegui: [C: 04-1] db-codfw.php: Increase weight for db2068 (200) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:42:33] (03PS2) 10Banyek: db-codfw.php: Increase weight for db2068 (200) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 [12:43:13] (03CR) 10Marostegui: db-codfw.php: Increase weight for db2068 (200) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:43:37] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Increase weight for db2068 (200) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:44:08] (03PS3) 10Banyek: db-codfw.php: Set normal weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 [12:45:27] (03CR) 10Marostegui: [C: 031] db-codfw.php: Set normal weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:46:53] (03CR) 10Banyek: [C: 032] db-codfw.php: Set normal weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:48:21] (03Merged) 10jenkins-bot: db-codfw.php: Set normal weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:48:37] Aⅼlah iѕ ԁoiᥒg [12:49:21] (03CR) 10jenkins-bot: db-codfw.php: Set normal weight for db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460508 (owner: 10Banyek) [12:51:43] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s) [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:51] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [12:53:44] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Banyek) [12:54:48] (03CR) 10Jcrespo: [C: 04-1] "mysql is deprecated, I think there is a mariadb::client one" [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [12:58:07] (03CR) 10Jcrespo: [C: 04-1] "There is a ::profile::mariadb::client to be used on a role, and there is the module mariadb::packages_client, depending on what wants to b" [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [12:59:28] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1051.eqiad.wmnet', 'cp2006.codfw.wmnet', 'cp2012.codfw.wmnet']... [13:03:05] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) a:03jcrespo [13:15:19] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [13:21:45] ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating Bstorm Working on the new user space tools for NFS and how it works with this. [13:24:00] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1051.eqiad.wmnet', 'cp2006.codfw.wmnet', 'cp2012.codfw.wmnet'] ``` and were **ALL** successful. [13:24:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [13:29:22] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [13:31:01] (03PS1) 10Andrew Bogott: Horizon: move 'phabricator' and 'git' projects to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/460531 [13:32:11] (03CR) 10Andrew Bogott: [C: 032] Horizon: move 'phabricator' and 'git' projects to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/460531 (owner: 10Andrew Bogott) [13:32:28] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [13:32:52] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) [13:32:54] 10Operations, 10ops-eqiad: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 (10jcrespo) [13:32:56] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10jcrespo) [13:35:37] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [13:36:38] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [13:37:41] 10Operations, 10ops-eqiad: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 (10jcrespo) I will shutdown this server fully at 14:11 UTC, this is a self reminder. [13:38:18] (03PS2) 10Ema: lvs: remove misc_web and misc_web-https [puppet] - 10https://gerrit.wikimedia.org/r/460218 (https://phabricator.wikimedia.org/T164609) [13:39:41] (03CR) 10Ema: [C: 032] lvs: remove misc_web and misc_web-https [puppet] - 10https://gerrit.wikimedia.org/r/460218 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [13:58:14] !log stop and restart db1118 for upgrade [13:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:25] PROBLEM - https://phabricator.wikimedia.org on phab1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:49] phab works for me [14:10:58] twentyafterfour ^^ [14:11:00] yea, works for me [14:11:08] marostegui it could be the case of the thread leak again [14:11:12] taking a look too [14:11:16] same here [14:11:40] https://phabricator.wikimedia.org/T182832 [14:11:43] !log shutting down db1062 for dc maintenance T204302 [14:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:51] T204302: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 [14:12:39] einsteinium (icinga) can also reach phab1001 [14:12:44] $USER1$/check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' [14:12:53] I think that misc-web-lb in the check is the issue [14:13:01] misc-web is no more [14:13:04] oh, doh [14:13:05] oh, switched to main cluster [14:13:15] I 'll fix it [14:13:18] thanks akosiaris [14:13:20] thx [14:13:26] and sorry for the noise [14:13:46] hah! thanks akosiaris [14:13:51] :) [14:14:03] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Tpt) Sorry everyone for the troubles. I was experimenting with a tool that tries to find corrections for constraint violations. I... [14:15:06] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn phab is ok. service check needs fix due to misc-web moving [14:15:12] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn phab is ok. service check needs fix due to misc-web moving [14:15:58] looks like phab was the only instance of misc-web in icinga [14:17:23] yea, besides comments in openstack talking about being "behind misc-web" [14:18:10] there's lots of references to cache_misc still, all to be removed next week: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460219/ [14:18:26] the "on phab1001" part of those alerts is super confusing anyways [14:18:50] (since it wasn't phab1001 at fault, and that's not what's actually being tested) [14:20:04] yea, fair enough. probably the check existed before phab was even behind caching layer and then was adjusted [14:20:20] (03PS1) 10Alexandros Kosiaris: Amend the phabricator icinga check [puppet] - 10https://gerrit.wikimedia.org/r/460540 [14:20:24] and kind of logically still belonged to it [14:21:04] yeah this check probably belongs under some other host [14:21:36] but which one.. could it be a virtual one [14:22:34] probably phabricator.wikimedia.org ? [14:22:45] as in, create a new one ? [14:22:56] tbh it should be under no host but icinga does not allow that [14:23:03] cause it's a software from the early 90s [14:23:15] (03PS1) 10Andrew Bogott: region-migrate: just use the same bastion for all ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/460541 [14:23:18] !log lvs1002: restart pybal to remove misc-web T164609 [14:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:25] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [14:23:26] yea, creating a new one is easy enough [14:25:46] that being said, doesn't make it super clear either what is actually being tested [14:28:44] !log lvs2002: restart pybal to remove misc-web T164609 [14:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:51] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [14:32:37] (03CR) 10Alexandros Kosiaris: [C: 031] tor_relay: add Bacula backups of Tor keys [puppet] - 10https://gerrit.wikimedia.org/r/460437 (owner: 10Dzahn) [14:33:49] (03PS2) 10Ema: Amend the phabricator icinga check [puppet] - 10https://gerrit.wikimedia.org/r/460540 (https://phabricator.wikimedia.org/T164609) (owner: 10Alexandros Kosiaris) [14:34:06] (03CR) 10Ema: [C: 031] Amend the phabricator icinga check [puppet] - 10https://gerrit.wikimedia.org/r/460540 (https://phabricator.wikimedia.org/T164609) (owner: 10Alexandros Kosiaris) [14:34:08] (03CR) 10Dzahn: [C: 031] "./check_http -S -H phabricator.wikimedia.org phabricator.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/460540 (https://phabricator.wikimedia.org/T164609) (owner: 10Alexandros Kosiaris) [14:36:55] (03PS38) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:36:57] (03PS4) 10Alex Monk: [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [14:42:14] !log lvs3002: restart pybal to remove misc-web T164609 [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [14:45:15] (03PS1) 10Dzahn: icinga/phabricator: move service check to a virtual host [puppet] - 10https://gerrit.wikimedia.org/r/460545 [14:46:02] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3008.esams.wmnet', 'cp2018.codfw.wmnet', 'cp1058.eqiad.wmnet']... [14:49:11] (03CR) 10BryanDavis: [C: 031] "syntax looks correct to me" [puppet] - 10https://gerrit.wikimedia.org/r/460467 (https://phabricator.wikimedia.org/T193848) (owner: 10Krinkle) [14:49:24] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:49:55] (03PS2) 10Andrew Bogott: region-migrate: just use the same bastion for all ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/460541 [14:50:23] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 79847 bytes in 0.213 second response time [14:52:11] 10Operations, 10Phabricator, 10Traffic: Allow traffic team to manage the traffic blog on phame - https://phabricator.wikimedia.org/T204355 (10ema) [14:52:24] 10Operations, 10Phabricator, 10Traffic: Allow traffic team to manage the traffic blog on phame - https://phabricator.wikimedia.org/T204355 (10ema) p:05Triage>03Normal [14:53:48] (03CR) 10Andrew Bogott: [C: 032] region-migrate: just use the same bastion for all ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/460541 (owner: 10Andrew Bogott) [14:53:54] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Jonas) @Tpt is it necessary for your tool to run the constraint checks in parallel? Using WDQS instead would be a good idea, beca... [14:56:09] 10Operations, 10Phabricator, 10Traffic: Allow traffic team to manage the traffic blog on phame - https://phabricator.wikimedia.org/T204355 (10Dzahn) I hit edit on https://phabricator.wikimedia.org/phame/blog/edit/11/ and from there clicked "Custom Policy" -> "Advanced"->"Custom Policy" next to "Editable by"... [14:57:07] Ꭺⅼlah iѕ doing [15:00:21] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3539 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:03:21] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [15:04:32] looks like a brief input rate spike [15:13:56] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) Trying again with a minimal configuration below that should retry and save failed messages. ``` module(load="impstats" interval="1... [15:15:09] (03CR) 10Faidon Liambotis: [C: 04-1] Add SNMP classes (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [15:16:12] yeah I'm taking a look too, looks like the udp input [15:16:17] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.4976 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:16:41] (03CR) 10Faidon Liambotis: [C: 04-1] Add SNMP classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [15:16:46] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1927 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:17:08] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1058.eqiad.wmnet', 'cp2018.codfw.wmnet', 'cp3008.esams.wmnet'] ``` and were **ALL** successful. [15:18:39] looks like wdqs' stacktraces [15:18:54] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4554407, @Krinkle wrote: > @mobrovac I think as a first step we should: > > * Standardise the name of... [15:19:01] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1061.eqiad.wmnet', 'cp2025.codfw.wmnet', 'cp3010.esams.wmnet']... [15:19:37] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4361 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:20:03] godog: huge java stack traces strike again? :P [15:20:35] heheh yes [15:20:44] the stack trace massacre of 2018 [15:20:54] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Cmjohnson) [15:21:12] godog: looking, we might be able to trim those a bit, or to rate limit them [15:22:33] 10Operations, 10ops-eqiad: db1062 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204302 (10Marostegui) Confirmed from my end too! ``` /admin1-> help [Usage] show [] [] [] [== ] set [... [15:23:50] godog: by large stack traces, did you mean the ones about "Could not create IV" ? [15:24:05] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Tpt) @Jonas Thank you for your feedback. > is it necessary for your tool to run the constraint checks in parallel? No, I am goi... [15:24:54] gehel: not particularly large but yeah looks like there's a bunch of warnings for those and then errors too, for something that looks related? i.e. date unparsable [15:25:07] yeah, probably a bot [15:25:53] (03CR) 10Ema: "noop on cache_text hosts: https://puppet-compiler.wmflabs.org/compiler1002/12461/" [puppet] - 10https://gerrit.wikimedia.org/r/460219 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:27:02] *nod* [15:34:30] gehel: anything we can do short term for that? [15:35:10] the "Could not create IV" are in a specific enough logger that we could drop them [15:36:30] godog: but that's interesting information, I would prefer not to drop :/ [15:36:37] random paranoia interjection: "IV" is usually a crypto term meaning "initialization vector". Maybe if something is spamming crypto-related stacktraces, they may be worth looking at. [15:39:01] (03CR) 10Imarlier: [C: 031] mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [15:40:02] (03PS2) 10EBernhardson: Mysql client not available on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/460451 [15:40:15] gehel: heh, am I getting it right that there's an ERROR for "invalid date" and then some (2/3?) WARNING for the same thing too? anyways yeah short term would be nice to drop or rate limit the warnings at least [15:40:19] bblack: nope, in this case it is internal blazegraph jargon, not crypto [15:40:28] ok :) [15:41:08] would dropping in logstash make an improvement since the input is being overloaded? [15:41:56] probably already too late on the logstash side [15:42:08] * gehel is digging into rate limiting per logger in logback [15:42:33] (03CR) 10Jcrespo: "This is how I would do it, but I would like to check nothing breaks with this, can you wait until monday to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [15:43:11] yeah likely already too late once logstash got the packet [15:44:08] (03CR) 10Jcrespo: Mysql client not available on mwdebug* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [15:45:27] we could rate limit on iptables though in this case [15:46:06] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:46:22] this is a good exercise and something to think about for future improvements — from an incident response standpoint how to quickly rate limit or drop before kafka/logstash [15:46:26] that’s true, just a udp threshold? [15:46:30] on that port [15:46:52] (03PS3) 10EBernhardson: Install mysql client to mediawiki canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/460451 [15:47:17] (03CR) 10EBernhardson: "sure there is no rush on this. I was looking into something yesterday and the command errored so thought i would try and fix it for the fu" [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [15:47:30] yeah sth like that [15:47:42] (03PS1) 10Gehel: wdqs: decrease logging of a few loggers which overload logstash [puppet] - 10https://gerrit.wikimedia.org/r/460550 [15:47:48] godog: ^ [15:48:20] (03CR) 10Filippo Giunchedi: [C: 031] wdqs: decrease logging of a few loggers which overload logstash [puppet] - 10https://gerrit.wikimedia.org/r/460550 (owner: 10Gehel) [15:48:24] gehel: thanks! [15:48:34] there's no good rate limiting out of the box, it would require writing one, and that's not something I want to deploy on a Friday :( [15:48:43] (03CR) 10Gehel: [C: 032] wdqs: decrease logging of a few loggers which overload logstash [puppet] - 10https://gerrit.wikimedia.org/r/460550 (owner: 10Gehel) [15:49:03] yeah for real [15:49:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:50:09] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1061.eqiad.wmnet', 'cp2025.codfw.wmnet', 'cp3010.esams.wmnet'] ``` and were **ALL** successful. [15:50:37] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [15:56:15] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) icinga still shows battery recharging....let's give it the weekend [15:56:46] (03PS3) 10Dzahn: Amend the phabricator icinga check [puppet] - 10https://gerrit.wikimedia.org/r/460540 (https://phabricator.wikimedia.org/T164609) (owner: 10Alexandros Kosiaris) [15:57:07] !log set thread_pool_size to 64 at db2047 [15:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:31] (03CR) 10Dzahn: [C: 032] "i had put it into 2h downtime, fixing it before that is timing out" [puppet] - 10https://gerrit.wikimedia.org/r/460540 (https://phabricator.wikimedia.org/T164609) (owner: 10Alexandros Kosiaris) [15:59:03] (03PS2) 10Dzahn: icinga/phabricator: move service check to a virtual host [puppet] - 10https://gerrit.wikimedia.org/r/460545 [16:00:15] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 35497 bytes in 0.411 second response time [16:00:23] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:01:05] oh well, i tried to avoid that extra sms .. but we see it's fixed [16:02:24] 10Operations, 10DBA: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) db1090 took a long time to recover replication lag after db1062 initial maintenance, even more than dbstore1002- need to check why next week. [16:04:29] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) All bans are temporary, so as soon as traffic returns to normal the bans will expire. It would be nice if there was a... [16:06:45] (03PS3) 10Dzahn: icinga/phabricator: move service check to a virtual host [puppet] - 10https://gerrit.wikimedia.org/r/460545 [16:07:14] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10kchapman) TechCom is putting this on last call ending on 26 September 2pm PST(21:00 UTC, 23:00 CET) [16:07:47] Aⅼⅼɑh ⅰѕ ԁoіng [16:07:47] suᥒ is ᥒot doinɡ Allah is ԁоing [16:09:01] (03CR) 10Dzahn: [C: 032] icinga/phabricator: move service check to a virtual host [puppet] - 10https://gerrit.wikimedia.org/r/460545 (owner: 10Dzahn) [16:09:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) a:05Papaul>03Jgreen @Jgreen it is all yours. Ping me when you are ready to do the install so you can show me how you do it. Wi... [16:10:05] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:11:00] gehel: looks like things are recovering, thanks again for your help! [16:11:15] godog: sorry for the overload! [16:11:45] RECOVERY - https://phabricator.wikimedia.org on phab1002 is OK: HTTP OK: HTTP/1.1 200 OK - 35497 bytes in 0.445 second response time [16:11:51] no worries at all [16:11:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Logstash: Rate limit wdqs logs - https://phabricator.wikimedia.org/T204364 (10Gehel) [16:11:55] godog: followup in T204364 [16:11:56] T204364: Rate limit wdqs logs - https://phabricator.wikimedia.org/T204364 [16:13:57] there is one more icinga check affected by misc-web being gone [16:14:02] cache_misc: Varnishkafka Webrequest Delivery Errors per second [16:14:28] SMalyshev: ^^^ [16:18:36] (03CR) 10Smalyshev: wdqs: decrease logging of a few loggers which overload logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460550 (owner: 10Gehel) [16:25:00] (03CR) 10Ppchelko: "Node semver lib doesn't understand `x`" [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [16:26:55] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03592 https://grafana.wikimedia.org/dashboard/db/logstash [16:27:27] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 (10herron) 05Open>03Resolved mx1001 has been stable for 24 hours. In Grafana deferrals on mx1001 do appear to be trending upwards (https://grafana.wikimedia.org/dashboa... [16:28:13] (03CR) 10Gehel: wdqs: decrease logging of a few loggers which overload logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460550 (owner: 10Gehel) [16:29:26] (03PS2) 10Dzahn: tor_relay: add Bacula backups of Tor keys [puppet] - 10https://gerrit.wikimedia.org/r/460437 [16:29:35] (03PS2) 10Thcipriani: Add tox.ini [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 [16:29:56] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [16:30:02] ^ me.. and on it [16:30:21] i wam waiting for it to create a new virtual host hopefully now [16:30:21] (03CR) 10Thcipriani: "> Sorry actually there is a better pattern I have just remembered" [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [16:35:22] (03PS1) 10Dzahn: varnishkafka/icinga: remove check for misc-web webrequests [puppet] - 10https://gerrit.wikimedia.org/r/460562 [16:36:45] (03PS2) 10Dzahn: varnishkafka/icinga: remove check for misc-web webrequests [puppet] - 10https://gerrit.wikimedia.org/r/460562 (https://phabricator.wikimedia.org/T164609) [16:38:01] (03CR) 1020after4: [C: 031] Add tox.ini [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [16:44:51] 10Operations, 10Performance-Team, 10Traffic: Stop oversampling Asian countries - https://phabricator.wikimedia.org/T204365 (10Imarlier) [16:45:07] (03PS1) 10Imarlier: wmf-config: remove oversampling for Asian countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460566 (https://phabricator.wikimedia.org/T204365) [16:48:15] 10Operations, 10Phabricator, 10Traffic: Allow traffic team to manage the traffic blog on phame - https://phabricator.wikimedia.org/T204355 (10greg) {F25864089} Ya'll should be able to do that now (ftr: I didn't make any additions, that's what it looked like after mutante's edits). The "custom policy" thing... [16:55:08] (03PS1) 10Dzahn: icinga/phabricator: don't declare monitoring host as virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/460569 [16:56:26] (03CR) 10Krinkle: [C: 031] wmf-config: remove oversampling for Asian countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460566 (https://phabricator.wikimedia.org/T204365) (owner: 10Imarlier) [16:59:15] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.97 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:01:24] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 71.46 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [17:02:21] (03CR) 10Dzahn: [C: 032] "the icinga host doesn't get realized like this unless the puppet class it is in is applied on the icinga hosts themselves" [puppet] - 10https://gerrit.wikimedia.org/r/460569 (owner: 10Dzahn) [17:04:12] (03PS3) 10Dzahn: tor_relay: add Bacula backups of Tor keys [puppet] - 10https://gerrit.wikimedia.org/r/460437 [17:05:16] (03CR) 10Dzahn: [C: 032] tor_relay: add Bacula backups of Tor keys [puppet] - 10https://gerrit.wikimedia.org/r/460437 (owner: 10Dzahn) [17:06:05] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) created a ticket with Dell You have successfully submitted request SR979751933. [17:07:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.17 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:08:42] PROBLEM - Check systemd state on torrelay1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:09:23] that's the bacula service i just added, not the actual service [17:09:32] Αⅼⅼaһ іs ԁoiᥒg [17:09:39] thx mutante was just looking [17:09:42] checking why that fails [17:09:49] i just wanted to add backups [17:10:00] bacula-fd.service [17:10:21] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:10:34] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10ssastry) >>! In T201366#4521540, @RobH wrote: > So the change to apply things the same as ruthnium won't work quite right, since we applied more than one role to that system. Sorr... [17:10:38] ^ yay, that was the other thing i was on :) [17:10:42] is tor related data something good to retain for longer via backups? probably not much logged there but just asking [17:10:46] icinga config due to the phab change [17:10:52] RECOVERY - Check systemd state on torrelay1001 is OK: OK - running: The system is fully operational [17:11:39] it's just about keeping the key files and fingerprint so if there was disk failure we'd be able to to recreate the relay with the same fingerprints and keep the guard and stable flags [17:11:58] ah, makes sense [17:12:10] the same data that we rsynced to migrate to a new machine [17:12:53] see https://metrics.torproject.org/rs.html#search/wikimedia how there are 2 green ones with the extra flags.. and the red one [17:13:00] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Raise alert level for old elasticsearch servers - https://phabricator.wikimedia.org/T204361 (10Gehel) The current config for that check is in [[ https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/elasticsearch/cirrus.yam... [17:13:20] the red one is what happens if the fingerprint changes..it's treated like new [17:14:05] nice [17:14:07] i guess the alert just fixed itself after next puppet run.. didnt really do it [17:16:49] regarding the icinga config that recovered.. well mostly fine, i still have a duplicate definition but it's just a warning. can fix it later. [17:17:13] definitely no errors anymore [17:18:04] funny is that the config check first shows you warnings and then further down says: Total Warnings: 0 too.. [17:35:03] (03CR) 10Elukey: [C: 031] varnishkafka/icinga: remove check for misc-web webrequests [puppet] - 10https://gerrit.wikimedia.org/r/460562 (https://phabricator.wikimedia.org/T164609) (owner: 10Dzahn) [17:39:25] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10herron) With the upgrade to stretch complete here is a snapshot of current mx1001 TLS ciphers, protocols, etc. (output from testssl) {P7550} In https://gerrit.wikimedia.org... [17:53:35] (03PS1) 10Ayounsi: Update SSH key for user ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/460576 (https://phabricator.wikimedia.org/T202486) [17:54:44] (03CR) 10Ayounsi: [C: 032] Update SSH key for user ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/460576 (https://phabricator.wikimedia.org/T202486) (owner: 10Ayounsi) [17:56:34] what's the last week before deployment freeze eoy this year? [17:57:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10ayounsi) Update, @Kalliope let us know if you're all set. [18:04:36] (03PS3) 10Dzahn: varnishkafka/icinga: remove check for misc-web webrequests [puppet] - 10https://gerrit.wikimedia.org/r/460562 (https://phabricator.wikimedia.org/T164609) [18:06:07] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:06:51] ^ XioNoX [18:08:19] er [18:08:56] I was on the merge yes/no screen and got distracted [18:08:56] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:09:02] all good now, thx [18:09:17] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:09:56] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:12:52] (03PS26) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [18:13:58] (03CR) 10Dzahn: [C: 032] varnishkafka/icinga: remove check for misc-web webrequests [puppet] - 10https://gerrit.wikimedia.org/r/460562 (https://phabricator.wikimedia.org/T164609) (owner: 10Dzahn) [18:17:28] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10Krinkle) Messages that are too long or got truncated appear to be reported in Logstash now, under `channel:jsonTruncated`. The only meta-data indexed for these is information from s... [18:19:39] (03PS1) 10Paladox: Add support for neutron ip range in puppet master standalone [puppet] - 10https://gerrit.wikimedia.org/r/460579 [18:20:59] (03PS2) 10Paladox: Add support for neutron ip range in puppet master standalone [puppet] - 10https://gerrit.wikimedia.org/r/460579 [18:25:26] (03PS1) 10Dzahn: icinga/phabricator: only monitor https on a single (virtual) host [puppet] - 10https://gerrit.wikimedia.org/r/460580 [18:29:14] (03CR) 10Andrew Bogott: [C: 032] Add support for neutron ip range in puppet master standalone [puppet] - 10https://gerrit.wikimedia.org/r/460579 (owner: 10Paladox) [18:29:19] (03CR) 10Andrew Bogott: [C: 032] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/460579 (owner: 10Paladox) [18:30:08] (03PS2) 10Dzahn: icinga/phabricator: only monitor https on a single (virtual) host [puppet] - 10https://gerrit.wikimedia.org/r/460580 [18:30:53] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10EBernhardson) Indeed the original problem here was that two log lines were merged into one. I haven't seen this problem with the logstash infrastructure, but that doesn't mean it doe... [18:31:33] (03CR) 10Dzahn: [C: 032] icinga/phabricator: only monitor https on a single (virtual) host [puppet] - 10https://gerrit.wikimedia.org/r/460580 (owner: 10Dzahn) [18:34:32] (03CR) 10Dzahn: "maybe we need to make a difference between a role for both types of testing, the one where we want notifications and the one where we don'" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [18:44:42] (03PS3) 10Dzahn: icinga: add notes_url parameter to NRPE monitor service [puppet] - 10https://gerrit.wikimedia.org/r/459641 (https://phabricator.wikimedia.org/T197873) [18:51:54] (03CR) 10Dzahn: [C: 032] icinga: add notes_url parameter to NRPE monitor service [puppet] - 10https://gerrit.wikimedia.org/r/459641 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:59:04] (03PS1) 10Dzahn: openstack/labs-puppet-enc: accept "eqiad-r" as a valid realm [puppet] - 10https://gerrit.wikimedia.org/r/460587 [18:59:09] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:00:40] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={create_container,run_podsandbox,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:00:44] (03CR) 10Dzahn: [C: 032] "icinga -v config check is now happy, no more warnings, no more errors" [puppet] - 10https://gerrit.wikimedia.org/r/460580 (owner: 10Dzahn) [19:01:09] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:02:48] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:08:39] (03PS1) 10Andrew Bogott: designate sink: Refrain from cleaning up migrating VMs [puppet] - 10https://gerrit.wikimedia.org/r/460589 (https://phabricator.wikimedia.org/T167293) [19:22:23] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @ssastry The issue Rob is mentioning is that there are direct includes in site.pp which are a violation of puppet lint/style checks: ``` 00:35:26 wmf-style: total violatio... [19:24:05] (03PS2) 10Andrew Bogott: designate sink: Refrain from cleaning up migrating VMs [puppet] - 10https://gerrit.wikimedia.org/r/460589 (https://phabricator.wikimedia.org/T167293) [19:37:24] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) The best fix would be to rename/convert role::parsoid::rt_server, role::parsoid::vd_server", :role::parsoid::rt_client, ::role::parsoid::vd_client and ::role::parsoid::diffs... [19:39:07] (03PS3) 10Andrew Bogott: designate sink: Refrain from cleaning up migrating VMs [puppet] - 10https://gerrit.wikimedia.org/r/460589 (https://phabricator.wikimedia.org/T167293) [19:40:42] (03Abandoned) 10Dzahn: openstack/labs-puppet-enc: accept "eqiad-r" as a valid realm [puppet] - 10https://gerrit.wikimedia.org/r/460587 (owner: 10Dzahn) [19:43:29] PROBLEM - Restbase root url on restbase2003 is CRITICAL: HTTP CRITICAL - No data received from host [19:44:38] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 16052 bytes in 0.122 second response time [19:56:42] Do non-security patches get backported upstream into Debian Stretch? (If the maintainer pushes it, obviously, but is it within policy?) If yes, what happens to the WMF cluster – does puppet auto-upgrade them? Does an SRE have to do it manually? [19:56:47] (03PS4) 10Andrew Bogott: designate sink: Refrain from cleaning up migrating VMs [puppet] - 10https://gerrit.wikimedia.org/r/460589 (https://phabricator.wikimedia.org/T167293) [19:56:59] (03CR) 10Andrew Bogott: [C: 032] designate sink: Refrain from cleaning up migrating VMs [puppet] - 10https://gerrit.wikimedia.org/r/460589 (https://phabricator.wikimedia.org/T167293) (owner: 10Andrew Bogott) [20:03:49] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10CRoslof) @tramm We can retain ownership of the domain name but change the nameservers to ones you control. What nameservers should we use? [20:15:24] (03PS1) 10Andrew Bogott: WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 [20:17:16] (03PS8) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [20:17:47] (03CR) 10Ayounsi: "Thanks for your time! Latest PS addresses your comments." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [20:20:45] (03PS2) 10Andrew Bogott: WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 [20:27:44] (03PS3) 10Andrew Bogott: WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 [20:28:22] (03CR) 10jerkins-bot: [V: 04-1] WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 (owner: 10Andrew Bogott) [20:29:42] (03PS4) 10Andrew Bogott: WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 [20:33:18] (03PS5) 10Andrew Bogott: WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 [20:36:10] (03CR) 10Andrew Bogott: [C: 032] WMCS puppetmasters: Use 'network::constants::labs_networks' for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/460599 (owner: 10Andrew Bogott) [20:37:56] (03PS1) 10Urbanecm: Enable Translate on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) [20:50:15] Ꭺⅼlаh іs dഠіng [20:50:15] ѕuᥒ is not ԁoinɡ Aⅼlaһ іs dоing [20:53:29] (03PS1) 10Dzahn: parsoid: role/profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) [20:54:12] (03CR) 10jerkins-bot: [V: 04-1] parsoid: role/profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [20:57:41] (03PS2) 10Dzahn: parsoid: role/profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) [21:01:00] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 5.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T200678 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [21:03:15] !log ACKed memory error alert on wtp2011 - existing ticket but fresh alert popped up 9h ago (T200678) [21:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:22] T200678: wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 [21:05:19] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:19] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 79837 bytes in 0.168 second response time [21:16:24] !log andrew@deploy1001 Started deploy [horizon/deploy@56340cd]: Fix proxy creation in neutron regions [21:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:17] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.6-1 - https://phabricator.wikimedia.org/T204383 (10thcipriani) p:05Triage>03High [21:19:55] !log andrew@deploy1001 Finished deploy [horizon/deploy@56340cd]: Fix proxy creation in neutron regions (duration: 03m 31s) [21:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:20] (03PS1) 10Thcipriani: Scap: upgrade to 3.8.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/460610 (https://phabricator.wikimedia.org/T204383) [21:20:37] (03PS2) 10Urbanecm: New throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) [21:20:50] (03CR) 10Urbanecm: "If you want me to...Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) (owner: 10Urbanecm) [21:21:17] Anybody to merge&deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/460432/2 please? [21:23:39] (03CR) 10Thcipriani: "> +1 but let's stall this for the next couple of days and merge on" [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [21:25:00] thcipriani, around to do a quick throttle deploy please? (^^^) [21:29:49] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) Here it is again after a day; This is definitely //something//, though not enough to be a game-changer (granted this is a pretty simplistic test). | {F25868691} | | `fstrim --all` @ 2018-09-13... [22:08:54] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Doesn't seem to be WDQS related entirely - e.g. if I call 'https://www.wikidata.org/w/api.php... [22:27:31] (03CR) 10Gehel: [C: 04-1] Elasticsearch module is coming up. (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [22:32:30] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Raise alert level on disk space for old elasticsearch servers - https://phabricator.wikimedia.org/T204361 (10Gehel) [22:36:04] Aⅼlаһ is doіnɡ [22:36:04] sᥙᥒ is not dഠiᥒg Ꭺlⅼah іѕ ԁoinɡ [22:36:04] moοn iѕ not ⅾοiᥒɡ Allaһ ⅰs doinɡ [23:14:39] (03CR) 10Faidon Liambotis: [C: 032] Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [23:18:52] Allаh is doing [23:18:52] sun iѕ not doіng Αⅼlaһ is doіnɡ [23:18:52] ⅿoοn is not doіng Ꭺllɑh is doing