[00:04:00] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/libs/rdbms/database: T201900 - I8ae754a2518 (duration: 00m 59s) [00:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:11] T201900: PHP Notice: Trying to get property 'num_rows' of non-object in /home/travis/build/wikimedia/mediawiki/includes/libs/rdbms/database/DatabaseMysqli.php on line 233 on PHP 7.2 travis builds - https://phabricator.wikimedia.org/T201900 [00:57:49] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Smalyshev) > do you have an idea on how different 7.3 is from 7.2 Shouldn't be very different. There were a number of... [01:18:59] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 98.4 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [01:40:53] 10Operations, 10netops, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10faidon) This sounds a lot like T133387, which we reported a while back and had ATAC and engineering involved... [02:09:30] PROBLEM - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [02:09:31] ACKNOWLEDGEMENT - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206500 [02:09:38] 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10ops-monitoring-bot) [02:18:54] (03PS2) 10Krinkle: profiler: Use wmfArcLampFlush() in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) [02:21:10] !log krinkle@deploy1001 Synchronized wmf-config/arclamp.php: T176916 - Id79baae90: ensure file exists before Ie86e88777c48 (duration: 00m 57s) [02:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:15] T176916: Set up sampling profiler for PHP 7 (alternative to HHVM Xenon) - https://phabricator.wikimedia.org/T176916 [02:22:59] (03CR) 10Krinkle: [C: 032] profiler: Use wmfArcLampFlush() in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [02:24:48] (03Merged) 10jenkins-bot: profiler: Use wmfArcLampFlush() in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [02:33:30] (03CR) 10jenkins-bot: profiler: Use wmfArcLampFlush() in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [02:41:05] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: T176916 / T206092 - Ie86e88777c48 (duration: 00m 56s) [02:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:15] T176916: Set up sampling profiler for PHP 7 (alternative to HHVM Xenon) - https://phabricator.wikimedia.org/T176916 [02:41:16] T206092: profiler.php sometimes emits RedisException "read error on connection" during request shutdown - https://phabricator.wikimedia.org/T206092 [03:14:00] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.305 second response time [03:29:10] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.572 second response time [03:32:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 832.39 seconds [04:18:50] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [04:19:20] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [04:21:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [04:29:30] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [04:33:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [04:43:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.46 seconds [04:45:37] (03PS1) 10Ladsgroup: labs: Enable search integration with Article Placeholder back with API backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465311 (https://phabricator.wikimedia.org/T195751) [04:47:35] (03CR) 10Ladsgroup: [C: 032] labs: Enable search integration with Article Placeholder back with API backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465311 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [04:49:04] (03Merged) 10jenkins-bot: labs: Enable search integration with Article Placeholder back with API backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465311 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [04:49:45] rebased on deploy1001 ^ [05:02:45] (03CR) 10jenkins-bot: labs: Enable search integration with Article Placeholder back with API backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465311 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [05:04:03] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) p:05Triage>03High a:03Cmjohnson @Cmjohnson can we replace this as soon as possible? This is enwiki primary master [05:07:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 [05:08:23] Amir1: Can I depool ^ or are you in the middle or something? [05:08:39] marostegui: hey, it's fine to depool [05:09:01] Amir1: great! thank you [05:09:22] Thank you! [05:09:23] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 (owner: 10Marostegui) [05:09:28] come on jenkins [05:10:17] (03PS2) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 [05:11:03] LOL [05:11:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 (owner: 10Marostegui) [05:13:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 (owner: 10Marostegui) [05:14:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 (duration: 00m 59s) [05:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465312 (owner: 10Marostegui) [05:19:14] !log Stop MySQL on db1122 for binlog format change, mysql and kernel upgrade [05:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:00] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33768 MB (3% inode=99%) [06:26:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465319 [06:31:11] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/dhparam.pem] [06:34:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465319 (owner: 10Marostegui) [06:36:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465319 (owner: 10Marostegui) [06:37:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 (duration: 00m 57s) [06:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:29] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33657 MB (3% inode=99%) [06:47:17] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465319 (owner: 10Marostegui) [06:52:18] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [06:56:30] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:50] RECOVERY - DPKG on contint1001 is OK: All packages OK [07:03:05] !log restart zuul and zuul-merger on contint1001 for the upgrade of zuul to finish [07:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:16] <_joe_> akosiaris: contint2001 as well ;) [07:04:30] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:50] _joe_: we probably should do some more investigation on that one [07:07:07] fix that bug if anything [07:07:40] <_joe_> ack [07:07:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:08:01] (03CR) 10jerkins-bot: [V: 04-1] scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:08:30] (03CR) 10Jcrespo: [C: 031] Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465175 (owner: 10Muehlenhoff) [07:10:04] (03PS3) 10Muehlenhoff: Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465175 [07:11:38] (03CR) 10Muehlenhoff: [C: 032] Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465175 (owner: 10Muehlenhoff) [07:12:17] (03PS3) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [07:12:58] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [07:22:40] (03PS2) 10Muehlenhoff: Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) [07:23:04] (03PS4) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) [07:23:06] (03PS2) 10Giuseppe Lavagetto: Purge the last references to jobqueue redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443420 (https://phabricator.wikimedia.org/T198220) [07:23:44] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [07:23:58] (03PS10) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) [07:29:57] (03PS3) 10Muehlenhoff: Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) [07:29:59] PROBLEM - High lag on wdqs2001 is CRITICAL: 4125 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:34:44] (03CR) 10Elukey: [C: 031] Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:36:30] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 348 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:39:18] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert donate.w.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [07:41:59] PROBLEM - High lag on wdqs2001 is CRITICAL: 4588 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:43:00] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 353 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:44:20] ACKNOWLEDGEMENT - High lag on wdqs2001 is CRITICAL: 4636 ge 3600 Mathew.onipe High lag issue ack. - T206423 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:47:00] RECOVERY - Disk space on eventlog1002 is OK: DISK OK [07:55:39] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:57:06] (03PS2) 10Muehlenhoff: Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) [07:58:39] (03CR) 10Gehel: [C: 032] prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [08:00:00] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:03:20] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) [08:04:50] PROBLEM - High lag on wdqs2001 is CRITICAL: 5596 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:05:25] (03PS3) 10Elukey: Replace analytics1003's occurrences with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/461997 (https://phabricator.wikimedia.org/T203635) [08:06:00] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 728 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:06:13] (03PS1) 10Muehlenhoff: Update/add Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465325 [08:06:19] !log depooling wdqs2001 to catch up on lag -T206423 [08:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:24] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 [08:07:46] (03PS4) 10Elukey: Replace analytics1003's occurrences with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/461997 (https://phabricator.wikimedia.org/T203635) [08:11:07] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Gehel) Looking at dropped packets, it looks like we did not have any over the last few days. So we have an... [08:13:39] PROBLEM - High lag on wdqs2002 is CRITICAL: 3606 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:15:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:16:22] !log update puppet compiler facts [08:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:19] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:18:11] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.176 second response time [08:19:00] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 614 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:19:26] (03CR) 10Muehlenhoff: [C: 031] "One nit, but looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [08:20:41] (03CR) 10Muehlenhoff: [C: 032] Update/add Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465325 (owner: 10Muehlenhoff) [08:23:20] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.168 second response time [08:24:29] PROBLEM - High lag on wdqs2002 is CRITICAL: 4120 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:26:39] PROBLEM - High lag on wdqs2001 is CRITICAL: 5868 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:26:40] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 718 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:41:49] PROBLEM - High lag on wdqs2002 is CRITICAL: 4925 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:43:59] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 974 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:45:02] (03CR) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [08:45:04] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12825/" [puppet] - 10https://gerrit.wikimedia.org/r/461997 (https://phabricator.wikimedia.org/T203635) (owner: 10Elukey) [08:45:48] (03PS5) 10Elukey: Replace analytics1003's occurrences with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/461997 (https://phabricator.wikimedia.org/T203635) [08:46:31] (03CR) 10Elukey: [C: 032] Replace analytics1003's occurrences with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/461997 (https://phabricator.wikimedia.org/T203635) (owner: 10Elukey) [08:51:40] PROBLEM - High lag on wdqs2002 is CRITICAL: 5356 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:56:18] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) [08:58:20] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:30] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 51 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:00:47] !log re-enable puppet/pybal on lvs1002, IPv6 connectivity with phab1001 working again T201039 [09:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:52] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [09:02:30] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [09:02:49] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [09:04:11] (03CR) 10Muehlenhoff: [C: 031] mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:04:19] RECOVERY - PyBal connections to etcd on lvs1002 is OK: OK: 10 connections established with conf1004.eqiad.wmnet:4001 (min=10) [09:04:43] (03PS1) 10Filippo Giunchedi: Merge tag 'upstream/0.7.0' [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465349 [09:04:45] (03PS1) 10Filippo Giunchedi: debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 [09:04:47] (03PS1) 10Filippo Giunchedi: debian: use standard rules for Prometheus packages [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465351 [09:04:49] (03PS1) 10Filippo Giunchedi: debian: update changelog [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465352 [09:04:49] PROBLEM - High lag on wdqs2002 is CRITICAL: 6034 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:06:59] PROBLEM - High lag on wdqs2002 is CRITICAL: 6139 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:08:58] (03CR) 10Muehlenhoff: debian: ship systemd service (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [09:11:19] PROBLEM - PyBal connections to etcd on lvs1005 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=10) [09:13:20] PROBLEM - High lag on wdqs2002 is CRITICAL: 6498 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:14:02] _joe_: perhaps pybal needs to be restarted on lvs1005 to pick up the config changes? (see etcd alert above) [09:15:27] (03PS1) 10Elukey: role::analytics_cluster::coordinator: deploy jobs to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/465361 (https://phabricator.wikimedia.org/T205509) [09:15:34] I see that the puppet run replaced conf2001 with conf1004 in pybal's config, so probably yes a restart is needed [09:15:47] <_joe_> ema: yes [09:15:52] k [09:15:54] <_joe_> it is needed indeed [09:16:04] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: deploy jobs to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/465361 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [09:16:09] <_joe_> nothing bad is happening, but that's the last thing to port back locally [09:16:24] perfect [09:16:28] !log restart pybal on lvs1005 to pick up config changes (conf2001 -> conf1004) [09:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:49] PROBLEM - High lag on wdqs2002 is CRITICAL: 6713 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:20:00] PROBLEM - High lag on wdqs2002 is CRITICAL: 6795 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:21:29] RECOVERY - PyBal connections to etcd on lvs1005 is OK: OK: 10 connections established with conf1004.eqiad.wmnet:4001 (min=10) [09:23:49] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:16] !log swapped Hadoop's hive/oozie from analytics1003 to an-coord1001 [09:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] (03PS5) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [09:26:26] (03PS4) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [09:31:02] PROBLEM - High lag on wdqs2002 is CRITICAL: 7299 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:46:12] PROBLEM - High lag on wdqs2002 is CRITICAL: 8040 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:46:24] (03PS3) 10Ema: varnishmedia: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) [09:48:21] PROBLEM - High lag on wdqs2002 is CRITICAL: 8134 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:49:04] (03CR) 10Ema: [C: 032] "The only graph using stats produced by varnishmedia, "Thumbnails served from Varnish" on https://grafana.wikimedia.org/dashboard/db/media," [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:52:01] (03PS1) 10Muehlenhoff: rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 [09:52:50] (03CR) 10jerkins-bot: [V: 04-1] rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [09:53:21] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishmedia] [09:53:35] looking ^ [09:53:42] PROBLEM - puppet last run on cp4023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishmedia] [09:54:21] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:54:33] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) [09:55:27] (03PS2) 10Muehlenhoff: rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 [09:56:33] (03CR) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [09:56:48] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/12828/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [09:58:51] RECOVERY - puppet last run on cp4023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:59:08] (03PS1) 10Elukey: profile::hive::site_hdfs: remove other permissions [puppet] - 10https://gerrit.wikimedia.org/r/465381 [10:01:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:01:47] (03CR) 10Joal: [C: 031] profile::hive::site_hdfs: remove other permissions [puppet] - 10https://gerrit.wikimedia.org/r/465381 (owner: 10Elukey) [10:01:59] (03CR) 10Elukey: [C: 032] profile::hive::site_hdfs: remove other permissions [puppet] - 10https://gerrit.wikimedia.org/r/465381 (owner: 10Elukey) [10:02:06] (03PS1) 10Elukey: role::analytics_cluster::coordinator: remove analytics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/465382 (https://phabricator.wikimedia.org/T205509) [10:02:24] (03PS2) 10Elukey: role::analytics_cluster::coordinator: remove analytics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/465382 (https://phabricator.wikimedia.org/T205509) [10:02:51] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:03] checking --^ [10:03:16] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: remove analytics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/465382 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [10:03:30] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:03:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:04:29] (03PS6) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [10:04:32] (03PS5) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [10:05:43] (03PS1) 10Ema: varnishmedia: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/465383 (https://phabricator.wikimedia.org/T184942) [10:14:28] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10faidon) einstenium and tegmen still run jessie and I didn't build a version for jessie-wikimedia. I believe they're being migrated to stretch as we speak, so maybe we should just wait for that. [10:15:07] (03CR) 10Alexandros Kosiaris: "TBH, I am not sure we should be setting hosts_allow/hosts_deny in the first place in rsyncd. Sure the software supports that, but what's t" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [10:17:06] (03PS7) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [10:17:08] (03PS6) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [10:24:35] (03CR) 10Muehlenhoff: "We have plenty of existing rsyncds with hosts_allow configured, with most of them setting up a ferm service to that same set of hosts. Cre" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [10:26:11] (03PS1) 10Elukey: Set analytics1003 as spare host (prep step for decom) [puppet] - 10https://gerrit.wikimedia.org/r/465386 (https://phabricator.wikimedia.org/T205509) [10:27:45] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12832/" [puppet] - 10https://gerrit.wikimedia.org/r/465386 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [10:29:00] (03PS2) 10Ema: varnishmedia: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/465383 (https://phabricator.wikimedia.org/T184942) [10:29:58] (03CR) 10Ema: [C: 032] varnishmedia: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/465383 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [10:34:11] !log about to perform live-test of the inverted switchdc (eqiad->codfw), actions will be real but basically noop due to codfw being already active - T203777 [10:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:18] T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 [10:35:21] akosiaris: I'm ready to start [10:35:53] !log deploying prometheus-blazegraph-exporter 0.6 on all wdqs clusters - T206123 [10:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:58] T206123: Monitor query / request concurrency on Blazegraph - https://phabricator.wikimedia.org/T206123 [10:36:38] volans: go [10:36:51] !log START - Cookbook sre.switchdc.mediawiki.00-disable-puppet (volans@neodymium) [10:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:34] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) this ticket is closed but more follow-up will appear in T206521 [10:37:49] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) (volans@neodymium) [10:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10ema) [10:39:59] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10ema) [10:40:04] volans: anything wrong ? I see you paused [10:40:14] I wrote to you in query :) [10:40:19] proceeding [10:40:23] !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@neodymium) [10:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:29] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@neodymium) [10:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:01] indeed [10:41:06] _joe_: OT we should really avoid conftool printing [Errno 2] No such file or directory: '/etc/conftool/etcdrc' if there is already another config file that it reads [10:41:30] !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium) [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:32] <_joe_> volans: I know :) [10:45:09] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) {T202782} for einsteinium, not sure about tegmen [10:45:23] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Nikerabbit) >>! In T191183#4098656, @demon wrote: > `avatars-gravatar` I am not installing. I did once, and was very quickly told not remove it. While that Wordpress proxy plugin s... [10:46:31] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Krenair) What about tegmen? [10:47:32] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium) [10:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:51] running it again as we said we want to run it multiple times [10:48:04] !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium) [10:48:04] <_joe_> volans: I don't see the point, frankly [10:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:15] <_joe_> given this is a functionality test? [10:48:22] _joe_: to check that the 2nd run on the same python process work as expected [10:48:29] and there isn't any state or thing that breaks [10:49:03] _joe_: I still have to press enter to actually do it [10:49:37] also we want to run it 3 times tomorrow so... [10:49:49] ah, alex went ahead :D running [10:49:54] !log repooling wdqs2001 catched up on lag - T206423 [10:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:59] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 [10:50:51] <_joe_> volans: we have a bug [10:51:01] <_joe_> we're running the backup twice [10:51:04] I know [10:51:08] already discussed [10:51:09] <_joe_> because we have two maintenance hosts [10:51:28] yeah already talked about it [10:51:40] <_joe_> can't we just turn off mwmaint1001 for good? [10:51:57] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium) [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:59] yes, that's the plan [10:52:19] <_joe_> ok, when is this going to happen? time's running out :) [10:52:36] I'll go ahead in the meanwhile with the other steps, not a blocker [10:53:00] and a super quick fix is to add batch 1 [10:53:15] after the switchback AFAIK [10:53:33] !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@neodymium) [10:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] !log END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) (volans@neodymium) [10:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:23] (03PS9) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [10:55:30] the next message of READ-ONLY period for MW will not be true... to avoid heart attacks ;) [10:56:12] !log START - Cookbook sre.switchdc.mediawiki.02-set-readonly (volans@neodymium) [10:56:12] !log [DRY-RUN] MediaWiki read-only period starts at: 2018-10-09 10:56:12.213026 (volans@neodymium) [10:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:23] !log END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) (volans@neodymium) [10:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:45] <_joe_> volans: I think you just set ReadOnly... [10:57:01] _joe_: no [10:57:11] <_joe_> I saw the key replicating [10:57:31] for eqiad yes [10:57:34] not for codfw [10:57:36] I can edit [10:57:38] tested [10:57:39] <_joe_> oh right [10:57:53] <_joe_> It's etcdmirror printing twice in debug [10:57:56] <_joe_> I fainted [10:58:09] lol, that's your fault (double logging) :-P [10:58:17] I'm sorry for the faint [10:58:32] <_joe_> not your fault [10:58:38] _joe_: according to https://phabricator.wikimedia.org/T201343#4626536 mwmaint1001 is only kept around as a fallback, it could also be removed before hand [10:58:51] (03CR) 10Alexandros Kosiaris: "Yeah I am not arguing on that. The question is more philosophical. In general, what's the point in configuring hosts_allow/hosts_deny if w" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [10:59:48] !log START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (volans@neodymium) [10:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181009T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:14] !log END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) (volans@neodymium) [11:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:39] _joe_: 2nd faint coming [11:03:05] <_joe_> volans: nah, I removed the damn debug already [11:03:23] !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@neodymium) [11:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:29] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@neodymium) [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:19] !log START - Cookbook sre.switchdc.mediawiki.04-switch-traffic (volans@neodymium) [11:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:04] <_joe_> we're testing a switch eqiad => codfw right? [11:05:05] (03PS1) 10Elukey: [WIP] Decommission conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) [11:05:32] _joe_: ofc! [11:06:37] I'm around for SWAT, but looks like there's nothing to do :D [11:06:47] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-traffic (exit_code=0) (volans@neodymium) [11:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:53] !log START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (volans@neodymium) [11:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:55] !log END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) (volans@neodymium) [11:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:34] !log START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (volans@neodymium) [11:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:35] <_joe_> logs looking good [11:08:37] !log END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) (volans@neodymium) [11:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:03] !log START - Cookbook sre.switchdc.mediawiki.07-set-readwrite (volans@neodymium) [11:11:05] !log [DRY-RUN] MediaWiki read-only period ends at: 2018-10-09 11:11:05.042622 (volans@neodymium) [11:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:05] !log END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) (volans@neodymium) [11:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:21] !log START - Cookbook sre.switchdc.mediawiki.08-restore-ttl (volans@neodymium) [11:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) (volans@neodymium) [11:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:42] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12833/" [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:11:49] !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@neodymium) [11:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:11] (03CR) 10Elukey: "First draft, waiting for -1s :D" [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:12:19] (03CR) 10Alexandros Kosiaris: [C: 031] "Quick test with parsoid says NOOP on deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [11:12:59] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) (volans@neodymium) [11:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:55] !log START - Cookbook sre.switchdc.mediawiki.08-update-tendril (volans@neodymium) [11:13:56] <_joe_> akosiaris: I think we could make the whole thing less confusing [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) (volans@neodymium) [11:14:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'd rather add a simple confd::file to fetch data for the canaries in the profile that declares all the dsh groups than doing all this if " [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [11:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:47] !log live-test of the inverted switchdc (eqiad->codfw) completed, all good - T203777 [11:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:51] T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 [11:15:33] (03CR) 10Muehlenhoff: "Most of the rsyncd services have set up hosts_allow patterns before we enabled Ferm everywhere... But using both is essentially free and t" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [11:25:10] _joe_: you mean vary the hiera lookup based on the dc ? [11:25:31] that would need to duplicate quite a few parts of the scap::dsh::group stuff [11:25:33] <_joe_> akosiaris: no I said a confd::file, so your template is mostly good [11:25:55] <_joe_> not much, you could special-case a lot of things in the template [11:26:15] well the confd::file would still needs keys (which is code that generates them) [11:26:21] and the template's name [11:26:50] there would be quite a bit of repetition around [11:27:01] either in puppet or in the confd template [11:27:12] <_joe_> you have two options for generating the keys: 1 - hardcode them in the puppet manifest 2 - extract that part of scap::dsh::group to a puppet function [11:27:20] <_joe_> I mean a function written in the puppet language [11:27:25] yeah 2 is not happening :P [11:27:31] <_joe_> not a parser function [11:27:45] <_joe_> puppet now has its own native functions, for data transformation [11:28:08] even so, how is that not duplicating functionality ? [11:28:35] having 2 different code paths for pretty much the same thing [11:28:48] <_joe_> well, let me try for a minute, if I can't get something decent I'll remove my -1 [11:28:50] with the logical branch being just if $canary [11:29:03] does sound quite confusing [11:36:43] (03CR) 10Gehel: [C: 04-1] "This mostly look good. Can you explain all the differences in https://puppet-compiler.wmflabs.org/compiler1002/12835/ ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [11:37:22] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (volans@neodymium) [11:37:22] !log END (ERROR) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=2) (volans@neodymium) [11:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:38] !log dry-run services switchover from codfw to eqiad in preparation for Thursday [11:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:59] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (volans@neodymium) [11:47:00] !log END (ERROR) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=2) (volans@neodymium) [11:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:06] (03CR) 10Alexandros Kosiaris: [C: 031] ">Most of the rsyncd services have set up hosts_allow patterns before we enabled Ferm everywhere..." [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [12:05:22] (03PS2) 10Ema: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [12:11:31] (03CR) 10Muehlenhoff: ">Then again it probably makes more sense to add the defense in depth via a different mechanism in the sensitive rsync daemons like a passw" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [12:14:03] (03CR) 10Filippo Giunchedi: debian: ship systemd service (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [12:19:12] (03PS1) 10Giuseppe Lavagetto: [WiP] Alternate approach to defining the MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [12:19:57] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Alternate approach to defining the MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 (owner: 10Giuseppe Lavagetto) [12:20:06] <_joe_> akosiaris: ^^ just a stub, lmk what you think [12:20:16] <_joe_> either way it's ugly-ish [12:25:55] (03CR) 10Muehlenhoff: debian: ship systemd service (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [12:26:00] (03PS2) 10Ema: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [12:26:55] (03CR) 10Mathew.onipe: "> Patch Set 9: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [12:31:25] (03PS1) 10Muehlenhoff: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/465413 [12:31:33] PROBLEM - High lag on wdqs2001 is CRITICAL: 3636 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:31:42] (03CR) 10Gilles: WIP: define haproxy service for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [12:32:41] (03CR) 10Elukey: [C: 031] "This account looks familiar, I think I trust it enough, good for me :)" [puppet] - 10https://gerrit.wikimedia.org/r/465413 (owner: 10Muehlenhoff) [12:32:43] (03PS1) 10Filippo Giunchedi: debian: add patch for inline udp usage [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465414 (https://phabricator.wikimedia.org/T205870) [12:33:39] (03PS1) 10ArielGlenn: fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306) [12:45:33] !log installing imagemagick security updates [12:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:43] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 211 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:52:13] PROBLEM - High lag on wdqs2001 is CRITICAL: 4518 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:53:53] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 7.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [12:54:29] !log silencing wdqs-public lag alerts (service still functional, and SLO unclear) - T199228 [12:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:34] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 [12:55:42] onimisionipe / SMalyshev: ^^^ [13:06:00] (03PS2) 10KartikMistry: apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 [13:06:04] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:08:38] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 7.001 ge 4 Muehlenhoff T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [13:09:09] (03PS1) 10Bmansurov: Stop collecting data CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) [13:09:23] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [13:09:49] (03CR) 10Bmansurov: [C: 04-1] "Deploy on 10/29." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [13:10:24] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 575 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:13:08] !log rebooting prometheus2003 for kernel security update [13:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] (03PS1) 10Ema: htcppurger: move varnish::packages requirement to cache profile [puppet] - 10https://gerrit.wikimedia.org/r/465420 (https://phabricator.wikimedia.org/T204208) [13:17:10] (03PS1) 10Ema: ATS: HTCP purging [puppet] - 10https://gerrit.wikimedia.org/r/465421 (https://phabricator.wikimedia.org/T204208) [13:23:43] (03PS3) 10Jcrespo: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) [13:24:20] (03PS2) 10Ema: ATS: add HTCP-based HTTP purging [puppet] - 10https://gerrit.wikimedia.org/r/465421 (https://phabricator.wikimedia.org/T204208) [13:27:59] 10Operations, 10monitoring, 10Patch-For-Review: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:28:45] !log rebooting prometheus2004 for kernel security update [13:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:24] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:35:15] (03CR) 10Gehel: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:37:18] (03CR) 10Gehel: [C: 04-1] "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:38:02] (03PS41) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:41:29] !log rebooting prometheus1003 for kernel security update [13:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:19] (03CR) 10Imarlier: [C: 031] "Rockin'." [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:43:41] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) >>! In T191183#4651560, @Nikerabbit wrote: >>>! In T191183#4098656, @demon wrote: >> `avatars-gravatar` I am not installing. I did once, and was very quickly told not remov... [13:44:29] (03PS1) 10Giuseppe Lavagetto: New release 1.0.3 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/465425 [13:47:16] (03CR) 10Giuseppe Lavagetto: [C: 032] New release 1.0.3 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/465425 (owner: 10Giuseppe Lavagetto) [13:54:22] !log rebooting prometheus1004 for kernel security update [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:17] (03CR) 10Ottomata: [C: 031] Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:06:00] (03PS42) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:07:35] (03PS1) 10Filippo Giunchedi: WIP: statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [14:09:57] (03CR) 10Ema: [C: 032] htcppurger: move varnish::packages requirement to cache profile [puppet] - 10https://gerrit.wikimedia.org/r/465420 (https://phabricator.wikimedia.org/T204208) (owner: 10Ema) [14:13:39] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Ottomata) Ahm, afaict this is very different and I am likely very ignorant here... buuuuut just in case you don't know about [[ https://github.com/wikimedia/cergen |... [14:22:28] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Krenair) It looks like that has some code involving interacting with X509 certs, but not ACME APIs or the Puppet fileserver API. It seems to have something to do wit... [14:22:36] (03CR) 10Ottomata: "I think you'll also want a hieradata/role/common/kafka/logging.yaml to configure profile::kafka::broker for your cluster. You can probabl" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [14:29:29] (03PS10) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [14:31:02] (03PS2) 10ArielGlenn: fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306) [14:31:52] (03PS43) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:38:32] (03CR) 10Ottomata: "We should move this away from an-coord1001. It may be necessary for non hdfs users to read hive-site.xml in HDFS. This also goes for the" [puppet] - 10https://gerrit.wikimedia.org/r/465381 (owner: 10Elukey) [14:39:34] (03PS1) 10Muehlenhoff: Remove tweaks to use Linux 4.14 on backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/465434 (https://phabricator.wikimedia.org/T196477) [14:41:34] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Ottomata) Yeah, its mostly for new certificate generation from CAs. Puppet CA is optional; a Letsencrypt (or ACME?) Certificate Signer class could be implemented.... [14:43:42] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Krenair) >>! In T194962#4652052, @Ottomata wrote: > Just saying it should be considered! Maybe if this had been suggested 3-5 months ago this could be considered. T... [14:44:06] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [14:44:37] (03CR) 10Mathew.onipe: "dry run:" [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [14:45:19] (03PS44) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:47:25] (03CR) 10Ema: [C: 032] ATS: add HTCP-based HTTP purging [puppet] - 10https://gerrit.wikimedia.org/r/465421 (https://phabricator.wikimedia.org/T204208) (owner: 10Ema) [14:50:23] (03PS3) 10Muehlenhoff: Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) [14:52:12] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [14:59:33] (03PS45) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [15:00:27] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [15:05:58] (03PS1) 10Giuseppe Lavagetto: puppet: update puppet-lint, wmf style checker versions [puppet] - 10https://gerrit.wikimedia.org/r/465439 [15:09:50] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: update puppet-lint, wmf style checker versions [puppet] - 10https://gerrit.wikimedia.org/r/465439 (owner: 10Giuseppe Lavagetto) [15:09:59] (03PS2) 10Giuseppe Lavagetto: puppet: update puppet-lint, wmf style checker versions [puppet] - 10https://gerrit.wikimedia.org/r/465439 [15:11:57] (03PS1) 10Muehlenhoff: Absent the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) [15:16:24] 10Operations, 10SRE-Access-Requests: Google Search Console access request - https://phabricator.wikimedia.org/T206544 (10Imarlier) [15:16:38] (03PS2) 10Giuseppe Lavagetto: [WiP] Alternate approach to defining the MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [15:16:42] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10herron) IMO removing `ping` is worth a shot, but I also wouldn't be surprised to find the timeout on `100-Continue` was a side-effect of another issue. In addition to the `100-Continue` errors, I'm als... [15:22:07] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) Hi @jkim_wikimedia We will need a few things to follow-up with your access request. Could you please: - add a short justification (just a few words) what t... [15:22:36] (03PS1) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) [15:23:22] (03CR) 10jerkins-bot: [V: 04-1] Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:24:44] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10AndyRussG) >>! In T206478#4650162, @Krenair wrote: > pgehres hasn't worked here for years He left a database, named in his honour . :) [15:25:00] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [15:25:27] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [15:26:30] (03PS2) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) [15:26:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:26:39] (03PS46) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:31:11] AndyRussG: o/ - still no news for banner data? :) [15:31:26] elukey: aaaaaargh [15:31:28] :) [15:31:39] apologies [15:32:19] elukey: we have a meeting today at 12:30 Pacific Time, I should have an answer by then the latest [15:32:21] apologies agai [15:32:23] again [15:32:41] thanks! [15:33:24] elukey: thank u!!! :) [15:34:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:36:39] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:04] ^ Papaul is swapping the NIC [15:42:05] (03CR) 10Filippo Giunchedi: Absent the diamond service when removing Diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:42:24] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:42:37] (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:42:58] (03CR) 10Elukey: [C: 031] Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:43:22] (03CR) 10Filippo Giunchedi: WIP: define haproxy service for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [15:43:54] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:44:13] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hi @Dzahn Thanks for helping with this! - Access is needed for me to track stats for fundraising emails - @CaitVirtue could you approve? - Co... [15:45:08] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 687 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:46:06] (03CR) 10Cwhite: [C: 031] Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:46:51] (03PS47) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:47:25] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Krenair) >>! In T206478#4652223, @jkim_wikimedia wrote: > - Could you advise on how to create a SSH keypair... take a look at `man ssh-keygen` - you should end up... [15:47:30] (03PS2) 10Filippo Giunchedi: logstash: add ipv6 to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) [15:47:32] (03PS2) 10Filippo Giunchedi: logstash: move to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) [15:47:34] (03PS5) 10Filippo Giunchedi: New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [15:47:36] (03PS6) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [15:48:25] (03CR) 10Cwhite: [C: 04-1] Absent the diamond service when removing Diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:49:06] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:49:08] godog: i forget: are you planning on having only a single cluster in eqiad? [15:49:44] ottomata: one per eqiad and codfw, though for the latter the hw isn't there yet [15:50:06] (03CR) 10Ottomata: "We'd need to import the new .deb package from confluent into our apt repo, but the rest should just work as is." [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:50:46] (03PS48) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:51:07] ok, godog i'm pretty sure if you put https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465167/6/hieradata/role/eqiad/logstash/elasticsearch.yaml into role/common instead, and leave of the -eqiad suffix, things will just work [15:51:16] and you won't need DC specific configs when you do get the codfw cluster onlnie [15:51:32] e.g. how main works with [15:51:33] profile::kafka::broker::kafka_cluster_name: main [15:51:52] (03CR) 10Cwhite: [C: 031] debian: add patch for inline udp usage [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465414 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [15:53:19] ottomata: ah! interesting, thanks I didn't realize that was the case [15:53:27] even better [15:53:35] see kafka_cluster_name.rb [15:54:12] yeah that makes sense, I grepped kafka_cluster_name in hieradata and some things like profile::cache::kafka::eventlogging::kafka_cluster_name use the full name, hence the confusion [15:54:19] yeah [15:54:24] sometimes you need both [15:54:28] depends on what you are trying to target [15:55:28] PROBLEM - Host backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:58] (03PS7) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [15:56:28] fixed ^ [15:57:14] (03CR) 10Cwhite: [C: 031] debian: update changelog [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465352 (owner: 10Filippo Giunchedi) [15:57:26] (03CR) 10Ottomata: [C: 031] site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:57:47] (03CR) 10Alex Monk: [C: 04-1] "also set profile::certcentral::server::config challenges stuff in prod hiera" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [15:57:51] (03CR) 10Herron: "Should we also update the elasticsearch network.publish_host and/or network.host settings to reflect ipv6 at this time? If not, maybe the" [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:58:45] (03CR) 10Ottomata: "> Is there anything we depend on from the new version?" [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:58:47] (03CR) 10Cwhite: [C: 031] debian: use standard rules for Prometheus packages [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465351 (owner: 10Filippo Giunchedi) [15:58:55] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10CaitVirtue) Approved [15:59:12] (03PS2) 10Cwhite: icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181009T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:02:06] (03PS49) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:05:56] (03CR) 10Cwhite: debian: ship systemd service (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [16:06:09] RECOVERY - Host backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.90 ms [16:08:30] 10Operations, 10SRE-Access-Requests: Google Search Console access request - https://phabricator.wikimedia.org/T206544 (10Dzahn) a:03Dzahn [16:11:00] (03PS50) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:15:36] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff the new NIC is in place [16:17:51] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [16:24:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Cmjohnson) I replaced the old raid card with the new one [16:27:52] (03PS1) 10Papaul: DNS: Add production and mgmt DNS entries for db2096 [dns] - 10https://gerrit.wikimedia.org/r/465449 (https://phabricator.wikimedia.org/T206191) [16:28:09] elukey: hey... just added a question to the task, which would probably be useful for dstrine and Seddon to give an opinion: if you build the realtime job based on EventLogging (currently only at 1% for everything) how long will it be before another job comes along and backfills with the full data from the old stream? [16:28:14] Thx much and apologies again!!! [16:28:40] 10Operations, 10SRE-Access-Requests: Google Search Console access request - https://phabricator.wikimedia.org/T206544 (10Dzahn) Done. I added imarlier@wikimedia to the domains listed above. I used the default level of access called "full" (as opposed to "restricted" or "owner"). [16:29:17] i.e., for about how long would the 1% sampled data from the new stream remain in Druid? [16:29:28] 10Operations, 10SRE-Access-Requests: Google Search Console access request - https://phabricator.wikimedia.org/T206544 (10Dzahn) 05Open>03Resolved [16:29:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [16:30:05] AndyRussG: so the data retetion needs to follow our guidelines, need to check with my team but usually we keep things for a week in druid [16:30:12] (not really sure about this point though) [16:30:20] I didn't get the point about the backfill though [16:36:21] (03PS1) 10Mathew.onipe: base::monitoring::host: added prometheus check for network receive drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) [16:38:54] (03PS5) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) [16:39:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) Server is racked in B4 switch port information : asw-b4-codfw ge-4/0/0 IP address: 10.192.16.34 [16:41:25] (03CR) 10jerkins-bot: [V: 04-1] base::monitoring::host: added prometheus check for network receive drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [16:43:21] (03CR) 10Cwhite: [C: 032] profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:45:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) ``` papaul@asw-b-codfw> show interfaces ge-4/0/0 descriptions Interface Admin Link Description ge-4/0/0 up up db2096 ``` ``` in... [16:46:20] _joe_: would you mind lending a hand with the puppet merge? [16:47:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [16:48:17] elukey: there is (or there would be) a realtime job, but there's also the current hourly job that uses the old pipeline, no? [16:48:57] AndyRussG: There is yes, and IIRC that one is still working [16:49:26] elukey: right [16:49:32] so that one would be left in place, no? [16:49:48] https://turnilo.wikimedia.org/#banner_activity_minutely/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgIwARSy0dQnRiKHomcOJNfFIfAF8AXSTIzSQ0R2cNbW4LJjsINmwoT1NzQv0QAHN3bBgEWggNbisRHwAJSShMOnxQQ2MCMzyXCENyDBxuOChyYmwC7CqQeKYkFib8OoQEFJA2GbcwvpBoAFl6jHwpRChiSIglhGDDAEcYMLQTckwYbAj9pcICoQG8PuFvr9/iBkkxNI8SHY/ [16:49:50] BVip5Mok4QjiHYAMo9egEMGfSF/AHPKrzOxbeoIJiUCBVShIRkE7a7IA [16:49:54] blrg [16:50:44] currently I see data up to 0 hrs UTC on October 8th [16:50:54] Sorry October 9th [16:50:58] So I guess now it's daily? [16:50:58] yep [16:51:03] that's what I meant by backfill [16:51:50] The realtime data from the higher-sampled new pipeline would be overwritten by daily or hourly less-sampled data from the old pipeline [16:51:51] (03CR) 10Dzahn: [C: 032] DNS: Add production and mgmt DNS entries for db2096 [dns] - 10https://gerrit.wikimedia.org/r/465449 (https://phabricator.wikimedia.org/T206191) (owner: 10Papaul) [16:51:52] correct? [16:52:24] we can have two separate datasets, one that keeps using the "old" pipeline and one the new EL schema [16:52:52] elukey: hmmm [16:53:04] ok I think that works [16:53:12] currently what's the delay with the old pipeline? [16:53:50] well we process webrequest once a day to populate banner data [16:54:10] but in theory we could get to have a hourly update [16:54:19] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:54:22] ah hmm yeah I think that's what there used to be [16:54:29] heck of a URL [16:54:42] (brb sorry) [16:54:46] Krenair: the blrg at the end was frustration not part of the url [16:54:56] :D [16:55:22] my CPU did overheat just from pasting it [16:56:03] lol [16:56:17] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10colewhite) @Krenair I believe tegmen will be upgraded once we are on icinga1001. [17:00:15] shdubsh _joe_ looks like there are pending changes to puppet-merge [17:01:56] herron: yes. afaict, mine are blocked by _joe_'s [17:04:40] I think he’s signed off already for the day [17:05:04] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) p:05Normal>03High Happened again, bumping the priority. [17:05:31] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) Remaining tasks: - fix fonts @mobrovac is going to provision new sever, where we can run `fonts-config` once again. We will check t... [17:10:05] (03PS1) 10BBlack: network::constants: fix druid ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/465453 [17:12:24] (03PS1) 10BBlack: XXX note bad entries for conf200x in network::constants [puppet] - 10https://gerrit.wikimedia.org/r/465455 [17:12:52] (03PS1) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 [17:13:48] (03CR) 10jerkins-bot: [V: 04-1] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (owner: 10Cwhite) [17:14:44] (03PS2) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [17:15:28] (03Abandoned) 10Cwhite: Revert "hiera: comment out diamond::remove" [puppet] - 10https://gerrit.wikimedia.org/r/464867 (owner: 10Cwhite) [17:17:00] (03CR) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:18:33] (03PS3) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [17:18:53] 10Operations, 10Release Pipeline, 10Epic, 10Release-Engineering-Team (Kanban), 10Services (watching): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [17:21:33] !log depooled wdq23 again, sigh [17:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 56.44 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:26:16] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Cmjohnson) @Marostegui disk swapped [17:28:12] (03PS1) 10Ottomata: Set hive_server_url for refine job to value that hive client class uses [puppet] - 10https://gerrit.wikimedia.org/r/465458 (https://phabricator.wikimedia.org/T205509) [17:28:38] (03CR) 10Elukey: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/465453 (owner: 10BBlack) [17:31:19] 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T206345 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson raid looks to be back after disk swap ...resolving cmjohnson@db1064:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware... [17:32:15] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) The battery was sent to our old office address in San Francisco, they are shipping a new battery...because it's a battery it has to go ground and will... [17:32:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 55.18 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:33:16] shdubsh: hello! Just got in your position :D [17:33:47] let's see the pending merges, should be harmless [17:34:39] yeah looks good to merge [17:34:51] shdubsh: shall I merge? [17:34:57] elukey: I agree, they look to be harmless, but _joe_ didn't like it when I merged his changes last time. [17:35:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Cmjohnson) @Bstorm the disk has been swapped...resolve once it's back to normal please [17:35:14] I am going to take the full blame if anything happens [17:35:41] merged [17:36:04] cool, thanks! [17:36:06] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey I am in conversation with DELL about the server, getting them the info they need.....nothing has been decided yet but as soon as they tell me what they're sending (should be a... [17:36:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [17:36:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 72.7 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:37:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 75.21 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:37:27] (03CR) 10Herron: [C: 031] "Looks good! Please see comment below re: sequence of manual steps." [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [17:38:50] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Cmjohnson) it is a new disk...trying it again [17:39:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Cmjohnson) new disk...trying it again [17:42:22] (03CR) 10Elukey: [C: 04-1] "This reverts a change in cdh :)" [puppet] - 10https://gerrit.wikimedia.org/r/465458 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata) [17:42:38] (03PS2) 10Muehlenhoff: Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) [17:44:54] (03CR) 10Muehlenhoff: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:45:30] (03PS2) 10Elukey: Set hive_server_url for refine job to value that hive client class uses [puppet] - 10https://gerrit.wikimedia.org/r/465458 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata) [17:45:38] RECOVERY - Device not healthy -SMART- on db1072 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops [17:45:56] (03CR) 10Ottomata: [V: 032 C: 032] Set hive_server_url for refine job to value that hive client class uses [puppet] - 10https://gerrit.wikimedia.org/r/465458 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata) [17:47:48] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) Wait, the NAT is only applied to connections in the egress path (from VM to the internet), and connections internal to CloudVPS are... [17:49:58] (03CR) 10Muehlenhoff: [C: 04-1] ntp: move diamond::collector to where it will only apply to ntp servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:54:43] (03CR) 10Herron: site: enable logging Kafka on Logstash nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [17:55:07] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) Pasting the (trimmed down) IRC discussion to this task to keep everyone in the loop: ``` there is no 830 replacement any longer for the 840 controller t... [17:59:38] PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:59:40] ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206558 [17:59:44] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206558 (10ops-monitoring-bot) [18:01:14] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206558 (10Marostegui) [18:01:16] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) [18:01:36] (03PS1) 10Ottomata: Refine job: Fix hive_server_url, add since and until params [puppet] - 10https://gerrit.wikimedia.org/r/465464 [18:04:34] (03CR) 10Ottomata: [C: 032] Refine job: Fix hive_server_url, add since and until params [puppet] - 10https://gerrit.wikimedia.org/r/465464 (owner: 10Ottomata) [18:06:39] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) >>! In T206261#4652649, @aborrero wrote: > Wait, the NAT is only applied to connections in the egress path (from VM to the internet),... [18:06:59] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:15:38] ACKNOWLEDGEMENT - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Stas Malychev Trying to switch to RC updater to see if it works better [18:15:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) Failed: ``` PD: 1 Information Enclosure Device ID: 32 Slot Number: 7 Drive's position: DiskGroup: 0, Span: 1, Arm: 1 Enclosure position: 1 Device Id: 7 WWN: 5000C50070CACB6C Sequence Number: 2... [18:18:19] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [18:19:46] (03PS1) 10Papaul: DHCP: Add MAC address entry for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/465466 (https://phabricator.wikimedia.org/T206191) [18:21:10] (03PS2) 10Ottomata: Add Accept header to varnishkafka webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) [18:23:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Smalyshev) So I tried to run it with RC updater, and it seems to be catching up much faster than with Kafk... [18:24:02] (03CR) 10Ottomata: [C: 032] Add Accept header to varnishkafka webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) (owner: 10Ottomata) [18:24:24] !log adding Accept header to all varnishkafka generated webrequest logs [18:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:29] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) Ok, it seems I'm wrong, will have to review my own docs lol So I will have to investigate how to better configure this, either what... [18:28:50] (03CR) 10Cwhite: Stop the diamond service when removing Diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:30:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [18:30:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Just added the `accept` field to the varnishkafka generated webrequest logs. @JAllemandou I haven't done this in a while, I'll ping you in m... [18:32:50] (03CR) 10Cwhite: "> Note that there are few cases where memcached is present (the" [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [18:37:41] (03PS3) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) [18:38:18] 10Operations, 10netops: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) I took the 16 hosts unable to reach the eqsin anchor over v6 during the last measurement (https://atlas.ripe.net/measurements/11645088/) and ran traceroutes from them to the e... [18:38:44] (03PS4) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [18:39:42] (03PS2) 10Dzahn: DHCP: Add MAC address entry for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/465466 (https://phabricator.wikimedia.org/T206191) (owner: 10Papaul) [18:39:51] (03PS5) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [18:39:58] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:40:20] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entry for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/465466 (https://phabricator.wikimedia.org/T206191) (owner: 10Papaul) [18:41:19] RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:47:19] PROBLEM - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [18:48:40] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:48:49] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) einsteinium is already out of warranty, which meant it was included in the original hardware refresh ticket/goal that this is part of. tegmen is stil... [18:53:35] 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10Dzahn) [18:55:15] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) requested new hardware to replace tegmen in T206563 [18:56:33] 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10Dzahn) [18:56:39] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [18:58:39] (03PS1) 10Ottomata: Use --verbose flag to is-yarn-application-running for spark crons [puppet] - 10https://gerrit.wikimedia.org/r/465473 (https://phabricator.wikimedia.org/T206555) [19:00:30] (03CR) 10Ottomata: [C: 031] site: enable logging Kafka on Logstash nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [19:01:11] (03CR) 10Ottomata: [C: 032] Use --verbose flag to is-yarn-application-running for spark crons [puppet] - 10https://gerrit.wikimedia.org/r/465473 (https://phabricator.wikimedia.org/T206555) (owner: 10Ottomata) [19:01:18] (03PS2) 10Ottomata: Use --verbose flag to is-yarn-application-running for spark crons [puppet] - 10https://gerrit.wikimedia.org/r/465473 (https://phabricator.wikimedia.org/T206555) [19:01:20] (03CR) 10Ottomata: [V: 032 C: 032] Use --verbose flag to is-yarn-application-running for spark crons [puppet] - 10https://gerrit.wikimedia.org/r/465473 (https://phabricator.wikimedia.org/T206555) (owner: 10Ottomata) [19:05:24] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10dungodung) [19:07:48] (03PS1) 10Ayounsi: Icinga: increase ripe atlas alerting threshold to 35 [puppet] - 10https://gerrit.wikimedia.org/r/465476 (https://phabricator.wikimedia.org/T205829) [19:12:02] (03CR) 10Gehel: [C: 031] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [19:17:48] jouncebot: next [19:17:48] In 3 hour(s) and 42 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181009T2300) [19:17:55] jouncebot: current [19:18:00] jouncebot: now [19:18:00] No deployments scheduled for the next 3 hour(s) and 41 minute(s) [19:18:52] *** We're going to be doing some network stuff that's intended to be non-disruptive [19:19:03] *** But please, hold of on other changes for a few, for clarity sake [19:20:53] !log bounce igmp-snooping on asw2-b-eqiad [19:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:43] 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10Dzahn) [19:25:02] !log disable igmp-snooping on asw2-b-eqiad - T201039 [19:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:07] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [19:35:23] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10dungodung) [19:37:12] !log disable igmp-snooping on asw2-c-eqiad - T201039 [19:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:16] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [19:41:15] *** Done with scary network things, resume normal stuff :) [19:51:12] 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10Peachey88) [20:00:19] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) a:05DarTar>03Nuria [20:01:28] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Nuria) Approved [20:04:06] (03PS1) 10Dzahn: add wikistats.org, wikitrends.org as parked domains [dns] - 10https://gerrit.wikimedia.org/r/465483 [20:04:28] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdg1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdg1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [20:06:30] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Nuria) @Tbayer, you do not need any special permissions to access any type of data, the datasources that were accessible through these permits have sinc... [20:08:32] !log repair /dev/sdg1 on ms-be2041 - T199198 [20:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:36] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [20:09:51] (03CR) 10Dzahn: [C: 032] add wikistats.org, wikitrends.org as parked domains [dns] - 10https://gerrit.wikimedia.org/r/465483 (owner: 10Dzahn) [20:11:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [20:11:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [20:13:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [20:13:48] XioNoX: ^^^ [20:13:52] FB again? [20:14:55] most likely... [20:15:03] bblack: ^ [20:15:46] possibly! [20:16:04] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) a:05Nuria>03Dzahn [20:16:09] I don't know what we can do about it except repool eqiad edge again [20:16:11] (03PS2) 10Dzahn: admin: add isaacj to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840) (owner: 10Herron) [20:16:25] or block FB entirely on the excess transfers [20:16:31] what in the heck are they doing [20:16:43] they're pulling tons of image data from cache_upload, if same as before [20:17:16] would it help if we call Ori? [20:17:31] domas you mean? [20:17:35] but I doubt it [20:18:44] maybe send an email to engineering@wmf to see who has contacts at FB, I recall we were talking to them about bc they were going to parse dumps for some integrations on their side [20:18:51] is it easy to block them? let's do that and maybe then they contact us soon enough [20:19:01] https://phabricator.wikimedia.org/T192688#4650445 [20:19:01] wondering if this isn't outcome from whatever internal product they are working on prefetching things [20:19:20] yeah probably [20:19:21] I know someone at FB as some sort of SRE type thing. But he's AFK for the day now (he's in the UK)... Can poke him tomorrow [20:19:32] they fetch tons and tons of thumbnails [20:19:46] I've friends too there fwiw [20:19:56] they do HEAD checks first as if they're looking to deduplicate or use existing cached data on their side, but still the thumbs pull volume is massive in terms of bytes output by us [20:20:23] etcd testing was done, so IMHO it's not a big deal to flip eqiad edge back on again early now (again) [20:20:47] I'm not sure who it was WMF side that had a convo w/ them about consuming our data but iirc VC was aware [20:21:09] yes, and we're due for some more/better convos going forward, but those will take time [20:21:52] yeppers [20:22:39] anyways, I'm going to repool eqiad edge for now, because it's the easiest thing to do in terms of protecting most of the users from seeing effects of the saturation (they might still saturate a codfw port regardless, but fewer humans will be sharing it) [20:22:56] if it's still a problem we can look at temporarily blocking FB [20:23:33] blocking them enitrely seems justified [20:23:33] bblack: can we only send Facebook to ulsfo? [20:23:57] yeah we could, but it's about the same thing [20:24:21] can't we start returning HTTP 429 and force the people responsible for the traffic to get in touch? [20:24:22] kinda [20:24:22] well, ulsfo is depooled, so we don't care if they saturate it [20:24:46] isn't that supposed to be temporary though? [20:24:46] it's not, last I looked! [20:24:59] ah, I thought it was still [20:25:05] nevermind then :) [20:25:06] but yes, we could re-depool ulsfo for everyone but them, but that's complicated [20:25:11] bblack: do you want me to try poking my friend there? [20:26:18] (03PS1) 10BBlack: Revert "Revert "Revert "traffic: Depool eqiad from user traffic for switchover""" [dns] - 10https://gerrit.wikimedia.org/r/465512 [20:26:43] (03CR) 10BBlack: [C: 032] Revert "Revert "Revert "traffic: Depool eqiad from user traffic for switchover""" [dns] - 10https://gerrit.wikimedia.org/r/465512 (owner: 10BBlack) [20:27:11] seriously [20:27:32] I don't think it's necessary. I mean, it can't hurt, but ultimately we're going to deal with this through official channels rather than un-official, and with a different sort of solution. [20:28:35] why can't the official channel be you telling their NOC to either put someone responsible for the traffic in contact, or block the traffic from leaving their network? [20:28:49] and if neither, block their requests [20:28:51] because :P [20:29:56] I don't recall, do we already have official channels open for this? [20:30:27] apparently someone spoke to their network team [20:30:36] yes, we do [20:35:18] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on einsteinium is CRITICAL: 59.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:36:04] ... okay that's probably a good thing in the context? [20:36:06] ^ that's requests draining off codfw [20:36:08] yes [20:36:36] (03CR) 10Dzahn: [C: 032] "approved by Dario and Nuria" [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840) (owner: 10Herron) [20:37:14] re: making one of our DCs exclusively for FB, temporarily: it's a tricky config change that gets to be a PITA to revert or mix around others, but we can do it [20:39:03] (03CR) 10Ayounsi: [C: 032] Icinga: increase ripe atlas alerting threshold to 35 [puppet] - 10https://gerrit.wikimedia.org/r/465476 (https://phabricator.wikimedia.org/T205829) (owner: 10Ayounsi) [20:39:13] (03PS2) 10Ayounsi: Icinga: increase ripe atlas alerting threshold to 35 [puppet] - 10https://gerrit.wikimedia.org/r/465476 (https://phabricator.wikimedia.org/T205829) [20:39:49] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#4651560, @Nikerabbit wrote: > Really? As far as I know all gravatar integrations send hashes of emails. This means they don't know which email it is, unless it... [20:41:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [20:41:09] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [20:41:41] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) Deployed on stat1006: ``` [stat1006:~] $ id isaacj uid=20171(isaacj) gid=500(wikidev) groups=500(wikidev),714(researchers) ``` on bast1001: ``` [... [20:41:43] is that thing just repeating the existing alert? [20:42:00] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) [20:42:41] yes, it repeats every :30 if the condition persists [20:43:05] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) 05Open>03Resolved This is done. Requested access should now work as expected. I can confirm the user has been created on stat1006 and puppet will c... [20:43:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [20:43:17] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12842/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/465476 (https://phabricator.wikimedia.org/T205829) (owner: 10Ayounsi) [20:45:56] (03PS5) 10Dzahn: mediawiki::web::prod_sites: convert wikinews.org [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [20:51:01] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac) I verified that I can ssh in - thanks all! [20:52:01] anyways, DNS move still apparently leaves them saturating a port, it just impacts fewer users and gives them more bandwidth :P [20:52:26] 10Operations, 10netops, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Open>03Resolved a:03ayounsi That should be good enough to make the alerts useful by removing the "false positive". Please reopen if still too no... [20:58:39] (03PS1) 10BBlack: Heavily ratelimit inbound FB requests to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/465517 [20:59:28] (03CR) 10BBlack: [C: 032] Heavily ratelimit inbound FB requests to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/465517 (owner: 10BBlack) [21:03:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [21:03:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [21:03:12] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:04:30] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [21:06:42] I don't think the ratelimiter actually caused the recovert, I think their spike of requests was already dropping off before the ratelimiter hit the caches [21:10:00] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on einsteinium is OK: (C)60 le (W)70 le 72.15 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:14:45] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) >>! In T178802#4653097, @Nuria wrote: > @Tbayer, you do not need any special permissions to access any type of data, the datasources that were a... [21:17:50] (03PS1) 10Dzahn: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) [21:18:41] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:19:52] (03PS2) 10Dzahn: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) [21:20:36] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:34:14] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10CRoslof) It sounds like adding a delegation for wikimediaeesti would just allow you to change the nameservers, but we can change the nameservers now without ad... [21:42:20] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 5879 bytes in 4.544 second response time [21:42:39] ^ unusual.. looking at that [21:48:21] PROBLEM - Check Varnish expiry mailbox lag on cp1085 is CRITICAL: CRITICAL: expiry mailbox lag is 2009372 [21:50:40] !log releases1001 - restarted jenkins (it went from 200 -> 503 -> 403) curl localhost:8080 works again after restart, icinga check still getting 403 now [21:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:23] !log cp1085: varnish backend restart for mbox lag [21:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:24] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10DStrine) [21:58:30] RECOVERY - Check Varnish expiry mailbox lag on cp1085 is OK: OK: expiry mailbox lag is 0 [22:10:10] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdl1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdl1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [22:15:01] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:18:50] (03CR) 10Ppchelko: [C: 031] "Still +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443420 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [22:20:01] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:22:06] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdl1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdl1 site=codfw cole_white running xfs_repair https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [22:24:51] PROBLEM - swift-object-replicator on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:25:01] PROBLEM - swift-object-updater on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:26:10] !log repairing /dev/sdl1 on ms-be2040 - T199198 [22:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:13] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [22:28:14] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [22:32:07] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [22:33:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) a:05Papaul>03Marostegui @Marostegui All yours. [22:34:38] PROBLEM - Disk space on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:41:46] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) The fully implemented secondary schema is ready for #techcom review: https://gerrit.wikimedia.org... [22:45:37] PROBLEM - configured eth on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:47:09] (03CR) 10Varnent: [C: 031] [GovernanceWiki] Disable wgRawHTML, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) (owner: 10Jforrester) [22:47:28] PROBLEM - dhclient process on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:49:17] PROBLEM - Check systemd state on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:49:17] PROBLEM - puppet last run on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:51:08] PROBLEM - Check the NTP synchronisation status of timesyncd on db2096 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:52:47] RECOVERY - dhclient process on db2096 is OK: PROCS OK: 0 processes with command name dhclient [22:52:58] RECOVERY - configured eth on db2096 is OK: OK - interfaces up [22:53:08] RECOVERY - Disk space on db2096 is OK: DISK OK [22:53:37] RECOVERY - Check systemd state on db2096 is OK: OK - running: The system is fully operational [22:54:27] RECOVERY - puppet last run on db2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:58:23] !log repooled wdqs2003 [22:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181009T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:10:07] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [23:21:16] RECOVERY - Check the NTP synchronisation status of timesyncd on db2096 is OK: OK: synced at Tue 2018-10-09 23:21:10 UTC. [23:28:09] 10Operations, 10MediaWiki-Revision-deletion, 10MediaWiki-Special-pages, 10Regression: Unable to change visibility of log entries on at least metawiki, outreachwiki and wikimania2018wiki - https://phabricator.wikimedia.org/T205908 (10Base) This might as well be some faulty rewrite or other server misconfigu... [23:34:32] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10thcipriani) >>! In T191183#4649245, @Dereckson wrote: >>>! In T191183#4647075, @Krinkle wrote: >> Gerrit wants 100x100px square thumbnails. > > The 100x100 size isn't what curren...