[00:00:37] PROBLEM - HTTPS-policy on policy.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate policy.wikimedia.org valid until 2018-09-05 23:59:59 +0000 (expires in 29 days) [00:02:57] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.007195 https://grafana.wikimedia.org/dashboard/db/logstash [00:02:57] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.0232 https://grafana.wikimedia.org/dashboard/db/logstash [00:03:27] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.005952 https://grafana.wikimedia.org/dashboard/db/logstash [00:03:32] (03PS1) 10Jforrester: Cleanup: Drop old comment about zhwiki priv changes from 2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450886 [00:03:34] (03PS1) 10Jforrester: Cleanup: Drop old comment for a global rollback group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450887 [00:03:36] (03PS1) 10Jforrester: Cleanup: Drop old comment for a global developer group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450888 [00:03:38] (03PS1) 10Jforrester: Cleanup: Drop old comments for general user access to FlaggedRevs on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450889 [00:03:40] (03PS1) 10Jforrester: Cleanup: Drop old comment for khmwikt [sic] import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450890 [00:09:14] (03PS2) 10Aaron Schulz: Use mcrouter for cache reads on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449604 (https://phabricator.wikimedia.org/T198239) [00:12:03] (03CR) 10Jforrester: [C: 031] Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 (owner: 10Reedy) [00:17:50] (03CR) 10Aaron Schulz: [C: 032] Use mcrouter for cache reads on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449604 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:19:08] (03Merged) 10jenkins-bot: Use mcrouter for cache reads on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449604 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:29:18] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Use mcrouter for cache reads on all wikis (duration: 00m 49s) [00:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:33] * AaronSchulz waits for mc spike to finish dropping to normal [00:37:11] likely due to LRU/eviction differences due to low reads on mcrouter and high reads on nutcracker. meh. [00:41:15] Cache callbacks will often access cache themselves, so GETs will be increase for a while too. [00:43:47] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1627 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [00:52:16] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.01922 https://grafana.wikimedia.org/dashboard/db/logstash [01:05:42] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1084.eqiad.wmnet,service=nginx [01:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:58] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1072.eqiad.wmnet,service=nginx [01:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1083.eqiad.wmnet,service=nginx [01:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:39] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1065.eqiad.wmnet,service=nginx [01:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:07] * AaronSchulz walks home [02:23:53] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.15) (duration: 08m 46s) [02:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:25] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Aug 7 02:34:24 UTC 2018 (duration 10m 31s) [02:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:06] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 39.01, 35.34, 32.23 [03:46:01] !log on mwmaint1001 running populateContentTables.php concurrently on wikidatawiki and commonswiki (T183488) [03:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:06] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [03:52:07] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 37.30, 31.96, 32.07 [03:59:17] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 36.76, 32.24, 32.06 [04:08:57] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1575 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [04:14:56] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1205 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [04:19:46] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04927 https://grafana.wikimedia.org/dashboard/db/logstash [04:20:42] (03CR) 10Zhuyifei1999: [C: 031] "Shall I merge + build this?" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [04:52:11] (03PS1) 10KartikMistry: hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) [05:08:39] (03CR) 10jerkins-bot: [V: 04-1] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:15:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450901 [05:17:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450901 (owner: 10Marostegui) [05:18:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450901 (owner: 10Marostegui) [05:21:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 (duration: 00m 50s) [05:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:26] !log Deploy schema change on db1123 T144010 T51190 T199368 [05:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:32] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:21:33] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:21:33] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:58:27] (03CR) 10Giuseppe Lavagetto: "a few minor style comments but it LGTM." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [06:14:10] 10Operations, 10Wikidata: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC - https://phabricator.wikimedia.org/T198049 (10tstarling) >>! In T198049#4310346, @jcrespo wrote: > 51,715 exceptions with: > > ``` > [{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitErr... [06:29:27] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:31:06] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:31:17] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mtail/varnishxcps.mtail] [06:32:47] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local] [06:41:13] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Volans) I think there was an agreement to install this a Stretch and perform this way the upgrade jessie->stretch of this cluster. In... [06:42:06] (03PS4) 10Prtksxna: Remove obsolete $wgPopupsBetaFeature from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 [06:42:08] (03PS1) 10Prtksxna: Remove obsolete $wgPopupsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 [06:43:02] (03CR) 10Prtksxna: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (owner: 10Prtksxna) [06:47:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450907 [06:52:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450907 (owner: 10Marostegui) [06:53:03] !log reboot lvs secondaries for kernel upgrade [06:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450907 (owner: 10Marostegui) [06:55:16] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:55:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 (duration: 00m 50s) [06:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:57] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:16] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:12] !log Deploy schema change on db1075 (s3 master) T144010 T51190 T199368 [06:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:18] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:58:19] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:58:19] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:58:46] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:50] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450908 [07:00:55] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450908 [07:01:53] 10Operations, 10Wikidata: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC - https://phabricator.wikimedia.org/T198049 (10jcrespo) I am not too worried about exceptions/error messages, I only pointed those in case it helped debug the real issues, the ones I mentioned at T198049#... [07:02:42] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450908 (owner: 10Marostegui) [07:03:58] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450908 (owner: 10Marostegui) [07:05:06] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool pc2006 T200641 (duration: 00m 49s) [07:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:11] T200641: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 [07:05:31] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) 05Open>03Resolved I have repooled the host so going to consider this resolved as there is not much else we can do - I am going to create a task to get pc2004 and pc2005's BIOS upgrade befo... [07:08:27] 10Operations, 10ops-codfw, 10DBA: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) [07:08:40] 10Operations, 10ops-codfw, 10DBA: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) p:05Triage>03Normal [07:12:53] (03PS1) 10Volans: Route puppetboard & debmonitor through cache_text [dns] - 10https://gerrit.wikimedia.org/r/450909 (https://phabricator.wikimedia.org/T164609) [07:13:59] (03CR) 10Ema: [C: 031] Route puppetboard & debmonitor through cache_text [dns] - 10https://gerrit.wikimedia.org/r/450909 (https://phabricator.wikimedia.org/T164609) (owner: 10Volans) [07:16:07] (03CR) 10Volans: [C: 032] Route puppetboard & debmonitor through cache_text [dns] - 10https://gerrit.wikimedia.org/r/450909 (https://phabricator.wikimedia.org/T164609) (owner: 10Volans) [07:16:43] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10Joe) Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit; knowing the origin of the 5xxs in broad categories ("public traffic for the sit... [07:29:49] !log migrated puppetboard and debmonitor to cache_text - T164609 [07:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:54] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [07:32:15] (03PS1) 10Jcrespo: mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) [07:34:49] (03CR) 10Marostegui: [C: 031] mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [07:34:57] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Volans) I'm ok with `monitor1001` for the naming, no strong opinion though so I'm open also for alternatives. [07:37:24] !log restarted es1019 prometheus exporter [07:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:50] !log restarted es1014 prometheus exporter (last message was wrong) [07:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:18] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 52369 MB (10% inode=99%) [07:52:28] RECOVERY - Disk space on elastic1017 is OK: DISK OK [07:58:26] 10Operations, 10netops: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10jcrespo) New issue: there seems to be connectivity issues between es1014 (B1) and prometheus1004 (B4), not intermitent, they are unable to ping . ``` root@es1014:/run/mysqld$ ping prometheu... [07:59:37] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10jcrespo) See T201139#4483590, probably more relevant here (diconnection between a B1 and a B4 host). [08:03:28] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 39.75, 34.15, 32.50 [08:12:42] (03PS2) 10Jcrespo: mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) [08:13:22] !log reboot tegmen with a new kernel [08:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:31] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [08:17:50] (03Merged) 10jenkins-bot: mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [08:21:07] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1821 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:24:38] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03982 https://grafana.wikimedia.org/dashboard/db/logstash [08:25:21] (03PS2) 10Gehel: elasticsearch: migrate codfw cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450062 (https://phabricator.wikimedia.org/T193649) [08:27:15] (03PS3) 10Giuseppe Lavagetto: mediawiki: makes includes explicit in private-https.conf [puppet] - 10https://gerrit.wikimedia.org/r/450585 [08:27:17] (03PS3) 10Giuseppe Lavagetto: mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) [08:27:20] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: switch back mediawiki_test [puppet] - 10https://gerrit.wikimedia.org/r/450918 [08:27:22] (03PS1) 10Giuseppe Lavagetto: mediawiki_test: remove mediawiki_exp module [puppet] - 10https://gerrit.wikimedia.org/r/450919 [08:28:02] !log reboot lvs-ulsfo primaries for kernel upgrade [08:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:23] (03CR) 10Gehel: [C: 032] elasticsearch: migrate codfw cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450062 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [08:28:54] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1019 (duration: 00m 49s) [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:05] !log start reimaging of elasticsearch / cirrus / codfw cluster (RAID0 / Stretch) - T193649 / T198391 [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:11] T198391: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 [08:29:12] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [08:29:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2001.codfw.wmnet'] ``` The log... [08:30:34] (03PS1) 10Vgutierrez: standard: add dns100[12] to eqiad ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450922 (https://phabricator.wikimedia.org/T196691) [08:37:07] RECOVERY - Elasticsearch HTTPS on relforge1002 is OK: SSL OK - Certificate relforge1002.eqiad.wmnet valid until 2023-08-06 08:36:14 +0000 (expires in 1824 days) [08:40:37] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2792 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:41:28] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2103 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:42:37] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2167 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:44:23] <_joe_> sigh [08:49:47] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04058 https://grafana.wikimedia.org/dashboard/db/logstash [08:49:57] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03488 https://grafana.wikimedia.org/dashboard/db/logstash [08:50:20] !log reboot lvs-eqsin primaries for kernel upgrade [08:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:48] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1202 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:54:39] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Peachey88) [08:55:42] (03CR) 10Jcrespo: "@_joe_, could you give me your thoughts on this vs. https://gerrit.wikimedia.org/r/345346 (or even a separate option alltoghether)." [puppet] - 10https://gerrit.wikimedia.org/r/449742 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [08:57:15] <_joe_> jynus: I didn't forget about it :) [08:57:28] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01094 https://grafana.wikimedia.org/dashboard/db/logstash [08:57:38] <_joe_> and yeah that is the most thorny issue [08:58:00] <_joe_> how to manage alerting, which is completely driven via puppet [08:58:11] indeed [08:58:26] I can set to warning dynamically [08:58:37] but I cannot change the paging policy dynamically [08:58:42] <_joe_> yes [08:59:01] although alerting policy does not really need to be fully dynamic, unlike routing [08:59:22] at least for what I want to do [08:59:25] <_joe_> yes [08:59:36] I think icinga should be more dynamic for other cases, though [08:59:51] <_joe_> yeah, but icinga is *not* dynamic :/ [08:59:55] he he [09:00:24] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 34.90, 32.11, 32.20 [09:00:30] <_joe_> what we could do, though, is to handle paging or not based on our internal logic in a script that icinga can run [09:00:49] <_joe_> ouch, appservers overload coming back? [09:00:54] <_joe_> sigh, here goes my day [09:02:27] <_joe_> !log depool mw1226 for investigation [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:33] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:03] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:09] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "Thanks for handling this." [puppet] - 10https://gerrit.wikimedia.org/r/450638 (owner: 10Andrew Bogott) [09:07:24] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time [09:07:54] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.040 second response time [09:07:59] !log reboot lvs3001 for kernel upgrade [09:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:24] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 31.95, 31.67, 32.01 [09:08:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2001.codfw.wmnet'] ``` and were **ALL** successful. [09:10:03] <_joe_> !log restarting hhvm on mw1226, then repooling it [09:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:36] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "I skipped the comment on purpose, since a phabricator reference is included in the commit message which can be inspected using git." [puppet] - 10https://gerrit.wikimedia.org/r/450610 (https://phabricator.wikimedia.org/T197176) (owner: 10BryanDavis) [09:13:33] <_joe_> !log restarting hhvm on mw1231, high load [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:52] (03PS1) 10Ema: wmf-upgrade-and-reboot: disable puppet before depool [puppet] - 10https://gerrit.wikimedia.org/r/450926 [09:15:14] <_joe_> !log rolling restart of HHVM in the api-eqiad cluster [09:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:34] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.31, 12.40, 23.10 [09:21:13] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 6.88, 13.54, 23.02 [09:28:06] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: switch back mediawiki_test [puppet] - 10https://gerrit.wikimedia.org/r/450918 [09:28:56] (03PS1) 10Gehel: elasticsearch: ensure that apt is refreshed before installing elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/450927 (https://phabricator.wikimedia.org/T193649) [09:29:13] PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:13] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 74541 bytes in 0.977 second response time [09:33:07] (03PS1) 10Jcrespo: mariadb: Setup db1095 and db1102 as db backup sources for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/450928 (https://phabricator.wikimedia.org/T201392) [09:33:15] (03CR) 10Vgutierrez: [C: 032] standard: add dns100[12] to eqiad ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450922 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [09:33:24] (03PS2) 10Vgutierrez: standard: add dns100[12] to eqiad ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450922 (https://phabricator.wikimedia.org/T196691) [09:33:27] <_joe_> uh? [09:33:35] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::webserver: switch back mediawiki_test [puppet] - 10https://gerrit.wikimedia.org/r/450918 (owner: 10Giuseppe Lavagetto) [09:33:46] <_joe_> oh sorry [09:33:56] <_joe_> I counter-merge-sniped you vgutierrez [09:34:07] uh... [09:34:26] go ahead and merge your stuff :) [09:34:31] <_joe_> I already did [09:34:44] <_joe_> that's why I said I counter sniped [09:34:44] (03PS3) 10Vgutierrez: standard: add dns100[12] to eqiad ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450922 (https://phabricator.wikimedia.org/T196691) [09:35:05] I ran puppet-merge and I saw your change there so.. :P [09:35:16] <_joe_> it's merged [09:36:10] mine as well.. everybody is happy :D [09:38:20] (03PS2) 10Giuseppe Lavagetto: mediawiki_test: remove mediawiki_exp module [puppet] - 10https://gerrit.wikimedia.org/r/450919 [09:38:23] (03PS1) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [09:38:29] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1001.wikimedia.org ``` The log can be found in `/v... [09:38:36] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1001.wikimedia.org'] ``` Of which those **FAILED**: ``` ['dns1001.wikimedia.org'] ``` [09:38:55] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1001.wikimedia.org ``` The log can be found in `/v... [09:39:00] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki_test: remove mediawiki_exp module [puppet] - 10https://gerrit.wikimedia.org/r/450919 (owner: 10Giuseppe Lavagetto) [09:40:14] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1002.wikimedia.org ``` The log can be found in `/v... [09:41:29] (03PS1) 10Jcrespo: install-server: Allow reimage of db110X hosts [puppet] - 10https://gerrit.wikimedia.org/r/450930 (https://phabricator.wikimedia.org/T201392) [09:43:07] !log reboot lvs-codfw primaries for kernel upgrade [09:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:47] 10Operations, 10monitoring, 10Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286 (10fgiunchedi) 05Resolved>03Open Reopening, looks like tegmen is suffering lots of nsca processes again :( ``` root@tegmen:~# ps fwwwaux | grep -c nsca... [09:47:37] (03PS2) 10Jcrespo: mariadb: Setup db1095 and db1102 as db backup sources for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/450928 (https://phabricator.wikimedia.org/T201392) [09:47:39] (03PS2) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [09:47:41] (03PS2) 10Jcrespo: install-server: Allow reimage of db1102 and db1095 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/450930 (https://phabricator.wikimedia.org/T201392) [09:47:53] (03PS3) 10Jcrespo: install-server: Allow reimage of db1102 and db1095 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/450930 (https://phabricator.wikimedia.org/T201392) [09:49:13] (03CR) 10Jcrespo: [C: 032] install-server: Allow reimage of db1102 and db1095 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/450930 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:50:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450931 [09:51:33] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10fgiunchedi) p:05Triage>03Normal [09:52:40] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10fgiunchedi) p:05Triage>03Normal [09:53:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450931 (owner: 10Marostegui) [09:53:52] (03PS3) 10Jcrespo: mariadb: Setup db1095 and db1102 as db backup sources for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/450928 (https://phabricator.wikimedia.org/T201392) [09:54:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450931 (owner: 10Marostegui) [09:54:53] (03CR) 10Jcrespo: [C: 032] mariadb: Setup db1095 and db1102 as db backup sources for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/450928 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:55:04] PROBLEM - Host 2620:0:861:4:d294:66ff:fe5f:6e82 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:d294:66ff:fe5f:6e82) [09:55:25] PROBLEM - Host 2620:0:861:1:d294:66ff:fe5f:5a1d is DOWN: PING CRITICAL - Packet loss = 100% [09:55:53] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2075 (duration: 00m 48s) [09:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:43] ^^ those two it's me (dns1001/dns1002) [09:57:14] PROBLEM - Host 2620:0:861:4:d294:66ff:fe5f:6e82 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:d294:66ff:fe5f:6e82) [09:57:34] ACKNOWLEDGEMENT - Host 2620:0:861:1:d294:66ff:fe5f:5a1d is DOWN: PING CRITICAL - Packet loss = 100% Vgutierrez T196691 [09:57:44] !log Deploy schema change on db2075 - T67448 T114117 T5119 [09:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:52] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [09:57:55] T5119: apply html-tidy only once - https://phabricator.wikimedia.org/T5119 [09:57:57] T67448: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 [09:58:34] ACKNOWLEDGEMENT - Host 2620:0:861:4:d294:66ff:fe5f:6e82 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:d294:66ff:fe5f:6e82) Vgutierrez T196691 [09:59:48] PROBLEM - Disk space on dns1002 is CRITICAL: Return code of 255 is out of bounds [10:01:50] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Johan) [10:01:52] 10Operations, 10Traffic, 10User-Johan, 10User-notice: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371 (10Johan) 05Open>03Resolved [10:02:23] (03CR) 10Gehel: "PPC seems to agree: https://puppet-compiler.wmflabs.org/compiler02/12003/elastic2001.codfw.wmnet/ but does not check dependency cycles as " [puppet] - 10https://gerrit.wikimedia.org/r/450927 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [10:04:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "I verified manually that the expansion is 1:1 for a specific virtual host here: https://puppet-compiler.wmflabs.org/compiler02/12004/mw126" [puppet] - 10https://gerrit.wikimedia.org/r/450585 (owner: 10Giuseppe Lavagetto) [10:04:37] PROBLEM - MD RAID on dns1002 is CRITICAL: Return code of 255 is out of bounds [10:04:38] PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:04:56] !log reboot einsteinium for kernel upgrade [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] (03PS4) 10Giuseppe Lavagetto: mediawiki: makes includes explicit in private-https.conf [puppet] - 10https://gerrit.wikimedia.org/r/450585 [10:06:22] I hope the restart doesn't make it lose acks/downs [10:07:15] it shouldn't no, I've gracefully stopped icinga too [10:08:14] sadly, it made wmf-auto-reimage fail [10:08:24] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1001.wikimedia.org'] ``` and were **ALL** successful. [10:08:38] we're back [10:08:45] jynus: gah, at what stage? [10:09:16] last one, it should be ok [10:09:29] I made it to not notify, puppet will run later [10:09:44] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1002.wikimedia.org'] ``` and were **ALL** successful. [10:09:55] RECOVERY - MD RAID on dns1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:10:26] (03PS1) 10Marostegui: filtered_tables: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/450934 (https://phabricator.wikimedia.org/T51191) [10:11:05] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1235 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [10:12:41] (03CR) 10Volans: [C: 031] "LGTM, does it show any diff in the compiler? (pure curiosity)" [puppet] - 10https://gerrit.wikimedia.org/r/450927 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [10:14:45] (03PS2) 10Gehel: elasticsearch: ensure that apt is refreshed before installing elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/450927 (https://phabricator.wikimedia.org/T193649) [10:15:01] (03PS1) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [10:15:41] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.005703 https://grafana.wikimedia.org/dashboard/db/logstash [10:16:24] (03CR) 10Gehel: [C: 032] elasticsearch: ensure that apt is refreshed before installing elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/450927 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [10:18:18] !log reboot lvs-eqiad primaries for kernel upgrade [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:14] !log shutting down es1019 for hw maintenance T201132 [10:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:19] T201132: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 [10:24:47] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10jcrespo) @Cmjohnson es1019 is fully depooled, alerts disabled and shutdown, please proceed directly with any task you need to do it, ping us here when finished. [10:26:51] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.123 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [10:27:02] RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS OK: 0.007 seconds response time. www.wikipedia.org returns 208.80.153.224 [10:29:11] RECOVERY - Disk space on dns1002 is OK: DISK OK [10:30:03] (03PS1) 10Jcrespo: mariadb: Depool db1122 and db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450939 (https://phabricator.wikimedia.org/T201392) [10:32:59] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [10:33:34] !log bounce logstash on logstash1007 for tests [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:28] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1122 and db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450939 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:37:30] (03PS1) 10Vgutierrez: conftool-data: Add dns100[12] to pdns_recursor service [puppet] - 10https://gerrit.wikimedia.org/r/450940 (https://phabricator.wikimedia.org/T196691) [10:38:05] <_joe_> jynus: I'll take a look at your patches in the afternoon [10:38:26] <_joe_> but I have no great idea for fixing the use of mw_primary there [10:38:35] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) [10:38:54] (03Merged) 10jenkins-bot: mariadb: Depool db1122 and db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450939 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:40:36] _joe_: we can allways keep it (not kidding), and leave a monitoring of puppet == etcd [10:41:45] <_joe_> jynus: meh, I want to think about this problem once and for all [10:41:49] sure [10:42:17] (03PS4) 10Giuseppe Lavagetto: mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) [10:44:02] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 and db1081 (duration: 00m 49s) [10:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:55] (03CR) 10Filippo Giunchedi: trafficserver: initial module/profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:45:52] !log shutdown and upgrade of db1122 [10:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:56] (03PS2) 10Mobrovac: Remove base64 hack for binary values decoding. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446785 (owner: 10Ppchelko) [10:51:41] !log shutdown and upgrade of db1082 [10:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:55] !log shutdown and upgrade of db1081 (last message was wrong) [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1100). [11:00:05] tgr and mobrovac: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] ok jouncebot [11:00:45] tgr|away and mobrovac: both of you are deployers, want to deploy your own changes? [11:00:58] I can SWAT if you would prefer that [11:00:58] can do [11:01:08] tgr: go ahead then :D [11:02:10] idem [11:02:19] tgr: just ping me once you are done, please [11:02:25] (03PS2) 10Gergő Tisza: Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698) [11:02:38] (03CR) 10Gergő Tisza: [C: 032] Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [11:04:10] (03Merged) 10jenkins-bot: Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [11:07:53] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450442|Remove hewiki interface-editor group (T200698)]] (duration: 00m 49s) [11:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:58] T200698: Merge two hewiki user groups - https://phabricator.wikimedia.org/T200698 [11:08:47] mobrovac: I'm done [11:09:59] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.948 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [11:12:06] that's me ^ [11:12:10] kk thnx tgr [11:13:07] (03PS3) 10Mobrovac: Remove base64 hack for binary values decoding. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446785 (owner: 10Ppchelko) [11:15:00] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [11:15:21] (03CR) 10Mobrovac: [C: 032] Remove base64 hack for binary values decoding. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446785 (owner: 10Ppchelko) [11:16:38] (03Merged) 10jenkins-bot: Remove base64 hack for binary values decoding. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446785 (owner: 10Ppchelko) [11:17:07] !log Remove unused repl grants for 10.64.0% [11:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:26] !log mobrovac@deploy1001 Synchronized rpc/RunSingleJob.php: RunSingleJob: remove unneeded base64 decoding (duration: 00m 49s) [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:31] (03CR) 10Volans: [C: 031] "LGTM looking at the PCC diffs (link below) and chatting with Giuseppe I agree it's equivalent to the current one, so a noop." [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:18:53] zeljkof: ok, i'm done too [11:18:57] i think that's it for this window [11:19:30] mobrovac: I don't see anything else in the calendar [11:19:33] !log EU SWAT finished [11:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:50] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:40] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 74515 bytes in 0.119 second response time [11:20:52] (03PS2) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [11:24:37] (03CR) 10Volans: "A full description with examples will be added into a documentation page once we agree on the API, feel free to ping me if you have any do" [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:55:37] !log Drop unused grants repl@208.80.155.117 [11:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:29] 10Operations, 10Wikidata: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC - https://phabricator.wikimedia.org/T198049 (10tstarling) The drop may have been caused by the API maxlag parameter. [[https://www.wikidata.org/wiki/Wikidata:Bots|Wikidata:Bots]] recommends using a maxlag... [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1200) [12:04:13] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2002.codfw.wmnet'] ``` The log... [12:05:28] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: disable labtestnet2001 and replace it with labtestnet2003 [puppet] - 10https://gerrit.wikimedia.org/r/450959 (https://phabricator.wikimedia.org/T196752) [12:07:18] (03PS1) 10Joal: role::aqs: deploy new Druid config [puppet] - 10https://gerrit.wikimedia.org/r/450960 [12:14:58] (03CR) 10Volans: wmf-upgrade-and-reboot: disable puppet before depool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450926 (owner: 10Ema) [12:20:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: disable labtestnet2001 and replace it with labtestnet2003 [puppet] - 10https://gerrit.wikimedia.org/r/450959 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [12:24:54] (03PS1) 10BBlack: authdns1001: add to nameservers data set [puppet] - 10https://gerrit.wikimedia.org/r/450964 [12:25:27] (03CR) 10BBlack: [C: 032] authdns1001: add to nameservers data set [puppet] - 10https://gerrit.wikimedia.org/r/450964 (owner: 10BBlack) [12:26:59] (03PS1) 10Ema: wmf_auto_reimage_lib: update docstrings to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/450965 [12:30:33] (03CR) 10Volans: "Thanks a lot for fixing those, just a nitpick inline, looks good otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/450965 (owner: 10Ema) [12:31:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:32:10] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2002.codfw.wmnet'] ``` and were **ALL** successful. [12:33:20] (03PS1) 10Lucas Werkmeister (WMDE): Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) [12:34:51] lots of [{exception_id}] {exception_url} Wikimedia\Timestamp\TimestampException from line 147 of /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/timestamp/src/ConvertibleTimestamp.php: Wikimedia\Timestamp\ConvertibleTimestamp::setTimestamp: Invalid timestamp [12:35:37] (03PS1) 10BBlack: authdns: pin gdnsd to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450967 [12:36:19] (03CR) 10jerkins-bot: [V: 04-1] authdns: pin gdnsd to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450967 (owner: 10BBlack) [12:37:13] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2003.codfw.wmnet', 'elastic2004... [12:37:19] (03CR) 10Vgutierrez: [C: 032] conftool-data: Add dns100[12] to pdns_recursor service [puppet] - 10https://gerrit.wikimedia.org/r/450940 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [12:37:29] (03PS2) 10Vgutierrez: conftool-data: Add dns100[12] to pdns_recursor service [puppet] - 10https://gerrit.wikimedia.org/r/450940 (https://phabricator.wikimedia.org/T196691) [12:37:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:40:32] error during compilation: Evaluation Error: Unknown function: 'os_version'. at /srv/workspace/puppet/modules/authdns/spec/fixtures/modules/authdns/manifests/init.pp:16:8 on node testhost.eqiad.wmnet [12:40:42] ^ I'm assuming this is CI fail rather than my patch? [12:40:59] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/450967 (owner: 10BBlack) [12:41:03] (03PS2) 10Ema: wmf_auto_reimage_lib: update docstrings to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/450965 [12:41:37] (03CR) 10jerkins-bot: [V: 04-1] authdns: pin gdnsd to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450967 (owner: 10BBlack) [12:41:42] (03CR) 10Ema: wmf_auto_reimage_lib: update docstrings to reflect reality (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/450965 (owner: 10Ema) [12:41:44] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org [12:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:07] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) p:05Triage>03Normal [12:42:34] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.6608 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:42:34] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5306 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:44:36] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns1002.wikimedia.org [12:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:17] (03CR) 10BBlack: [V: 032 C: 032] "Assuming CI is wrong, will revert if not!" [puppet] - 10https://gerrit.wikimedia.org/r/450967 (owner: 10BBlack) [12:45:33] (03PS2) 10BBlack: authdns: pin gdnsd to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450967 [12:45:34] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.03716 https://grafana.wikimedia.org/dashboard/db/logstash [12:45:41] (03CR) 10BBlack: [V: 032 C: 032] authdns: pin gdnsd to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450967 (owner: 10BBlack) [12:47:34] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1137 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:48:48] (03PS3) 10Ema: wmf_auto_reimage_lib: update docstrings to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/450965 [12:51:34] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2018 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:53:24] (03PS1) 10Filippo Giunchedi: logstash: enable persistent queues [puppet] - 10https://gerrit.wikimedia.org/r/450971 (https://phabricator.wikimedia.org/T200960) [12:53:35] PROBLEM - Host authdns1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:25] RECOVERY - Host authdns1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [12:55:56] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/450965 (owner: 10Ema) [12:56:50] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/12006/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/450971 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [12:56:52] (03CR) 10Ema: [C: 032] wmf_auto_reimage_lib: update docstrings to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/450965 (owner: 10Ema) [12:58:43] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1631 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:58:46] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=chromium.wikimedia.org [12:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org [12:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:17] (03CR) 10Filippo Giunchedi: [C: 032] logstash: enable persistent queues [puppet] - 10https://gerrit.wikimedia.org/r/450971 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [12:59:25] (03PS2) 10Filippo Giunchedi: logstash: enable persistent queues [puppet] - 10https://gerrit.wikimedia.org/r/450971 (https://phabricator.wikimedia.org/T200960) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1300) [13:02:53] RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [13:03:13] ^ interesting [13:04:55] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2004.codfw.wmnet', 'elastic2005.codfw.wmnet', 'elastic2003.codfw.wmnet'] ``` an... [13:05:32] (03PS2) 10Ottomata: role::aqs: deploy new Druid config [puppet] - 10https://gerrit.wikimedia.org/r/450960 (owner: 10Joal) [13:05:44] (03CR) 10Ottomata: [V: 032 C: 032] role::aqs: deploy new Druid config [puppet] - 10https://gerrit.wikimedia.org/r/450960 (owner: 10Joal) [13:06:44] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1122 and db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450974 [13:07:03] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [13:07:05] 10Operations, 10ops-eqiad, 10DBA: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05stalled>03Resolved It recovered itself: ``` ˜/icinga-wm 15:02> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashb... [13:07:22] (03PS1) 10Filippo Giunchedi: logstash: brown paperbag fix [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) [13:07:44] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04513 https://grafana.wikimedia.org/dashboard/db/logstash [13:07:44] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01544 https://grafana.wikimedia.org/dashboard/db/logstash [13:08:08] (03CR) 10jerkins-bot: [V: 04-1] logstash: brown paperbag fix [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:10:13] (03PS2) 10Filippo Giunchedi: logstash: brown paperbag fix [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) [13:10:20] (03PS2) 10Ema: wmf-upgrade-and-reboot: disable puppet before depool [puppet] - 10https://gerrit.wikimedia.org/r/450926 [13:10:26] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 (10Krinkle) [13:10:55] !log otto@deploy1001 Started restart [analytics/aqs/deploy@6fafc63]: Bouncing AQS for mediawiki_history index update [13:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:57] (03PS1) 10Vgutierrez: smokeping: Replace hydrogen & chromium with dns1001 and dns1002 [puppet] - 10https://gerrit.wikimedia.org/r/450977 (https://phabricator.wikimedia.org/T196691) [13:12:10] (03CR) 10Filippo Giunchedi: [C: 032] logstash: brown paperbag fix [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:12:18] (03PS3) 10Filippo Giunchedi: logstash: brown paperbag fix [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) [13:12:29] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/12008/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/450976 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:13:21] (03CR) 10Volans: [C: 031] "LGTM, thanks for adding this to the library!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450926 (owner: 10Ema) [13:15:45] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [13:16:00] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) Added see also: {T193050} [13:16:42] 10Operations, 10Traffic, 10netops: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) p:05Triage>03Normal [13:20:13] (03PS3) 10Ema: wmf-upgrade-and-reboot: disable puppet before depool [puppet] - 10https://gerrit.wikimedia.org/r/450926 [13:21:17] (03CR) 10Ema: [C: 032] wmf-upgrade-and-reboot: disable puppet before depool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450926 (owner: 10Ema) [13:23:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:27:06] known ^ [13:28:58] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) I've enabled disk persisted queues in logstash, early to tell now but it looks like that "fixed" (papered over) the issue so slow pipelines and output... [13:32:00] (03PS1) 10Vgutierrez: lvs: use the new dns100[12] recursive DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/450982 (https://phabricator.wikimedia.org/T196691) [13:32:39] (03CR) 10Ayounsi: [C: 031] smokeping: Replace hydrogen & chromium with dns1001 and dns1002 [puppet] - 10https://gerrit.wikimedia.org/r/450977 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [13:33:11] (03CR) 10Vgutierrez: [C: 032] smokeping: Replace hydrogen & chromium with dns1001 and dns1002 [puppet] - 10https://gerrit.wikimedia.org/r/450977 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [13:33:19] (03PS2) 10Vgutierrez: smokeping: Replace hydrogen & chromium with dns1001 and dns1002 [puppet] - 10https://gerrit.wikimedia.org/r/450977 (https://phabricator.wikimedia.org/T196691) [13:34:20] XioNoX: thx :) [13:35:35] (03CR) 10Vgutierrez: [C: 032] lvs: use the new dns100[12] recursive DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/450982 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [13:35:43] (03PS2) 10Vgutierrez: lvs: use the new dns100[12] recursive DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/450982 (https://phabricator.wikimedia.org/T196691) [13:35:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:36:09] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=nginx [13:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:23] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1071.eqiad.wmnet,service=nginx [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:10] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1081.eqiad.wmnet,service=nginx [13:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:24] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1055.eqiad.wmnet,service=nginx [13:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:02] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1083.eqiad.wmnet [13:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:15] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1065.eqiad.wmnet [13:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1084.eqiad.wmnet [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:58] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1072.eqiad.wmnet [13:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:39] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=varnish-fe [13:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:54] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1071.eqiad.wmnet,service=varnish-fe [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:07] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1081.eqiad.wmnet,service=varnish-fe [13:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:17] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1055.eqiad.wmnet,service=varnish-fe [13:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:34] (03PS1) 10Vgutierrez: hieradata: Get rid of hydrogen and chromium references [puppet] - 10https://gerrit.wikimedia.org/r/450984 [13:43:00] (03PS2) 10Vgutierrez: hieradata: Get rid of hydrogen and chromium references [puppet] - 10https://gerrit.wikimedia.org/r/450984 (https://phabricator.wikimedia.org/T196691) [13:50:58] (03PS1) 10Vgutierrez: standard: Remove chromium and hydrogen from ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450986 (https://phabricator.wikimedia.org/T196691) [13:51:43] (03CR) 10Vgutierrez: [C: 032] hieradata: Get rid of hydrogen and chromium references [puppet] - 10https://gerrit.wikimedia.org/r/450984 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [13:51:46] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:52:56] (03PS3) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [13:53:18] (03PS1) 10Volans: Fix docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) [13:53:39] (03CR) 10jenkins-bot: Revert "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450879 (owner: 10Jforrester) [13:53:41] (03CR) 10jenkins-bot: Use mcrouter for cache reads on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449604 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [13:53:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450901 (owner: 10Marostegui) [13:53:45] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450907 (owner: 10Marostegui) [13:53:47] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450908 (owner: 10Marostegui) [13:53:49] (03CR) 10jenkins-bot: mariadb: Depool es1019 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450912 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [13:53:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450931 (owner: 10Marostegui) [13:53:53] (03CR) 10jenkins-bot: mariadb: Depool db1122 and db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450939 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [13:53:55] (03CR) 10jenkins-bot: Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [13:53:57] (03CR) 10jenkins-bot: Remove base64 hack for binary values decoding. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446785 (owner: 10Ppchelko) [13:57:32] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Dzahn) I was able to reinstall it and get it up again .. but the memory error should still be checked. [14:01:28] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Cmjohnson) @Dzahn Is this okay to take down? I will need to do a couple of things before I can create a replacement ticket with Dell. [14:01:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:02:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:07:33] 10Operations, 10ops-eqiad, 10DBA: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10fgiunchedi) Indeed it can happen since the alert is errors over four days, if no new errors come in the alert will recover [14:07:40] (03PS1) 10Jcrespo: mariadb: Repool es1019 and db1102 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450990 (https://phabricator.wikimedia.org/T201132) [14:11:17] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=varnish-be [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:23] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1071.eqiad.wmnet,service=varnish-be [14:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1081.eqiad.wmnet,service=varnish-be [14:11:32] (03CR) 10jerkins-bot: [V: 04-1] Fix docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:38] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1055.eqiad.wmnet,service=varnish-be [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] (03PS1) 10Herron: logstash: double jvm heap size to 1g [puppet] - 10https://gerrit.wikimedia.org/r/450991 (https://phabricator.wikimedia.org/T200960) [14:11:43] (03PS2) 10Jcrespo: mariadb: Repool es1019 and db1102 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450990 (https://phabricator.wikimedia.org/T201132) [14:12:49] (03CR) 10Herron: [C: 032] logstash: double jvm heap size to 1g [puppet] - 10https://gerrit.wikimedia.org/r/450991 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [14:13:34] (03PS2) 10Volans: Fix docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) [14:13:41] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1019 and db1102 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450990 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:14:00] (03PS1) 10Filippo Giunchedi: logstash: don't restart daily [puppet] - 10https://gerrit.wikimedia.org/r/450992 (https://phabricator.wikimedia.org/T200960) [14:15:11] (03Merged) 10jenkins-bot: mariadb: Repool es1019 and db1102 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450990 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:16:45] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp2002 is CRITICAL: connect to address 10.192.0.123 and port 3128: Connection refused [14:17:22] (03CR) 10Hoo man: [C: 031] "Please note that this wont affect the current dumps as Lexemes are generally excluded there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [14:18:10] cp2002 just rebooted, looking ^ [14:18:22] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019 and db1102 with low load after maintenance (duration: 00m 52s) [14:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:41] !log double again logstash jvm heap size to 1g and rolling restart logstash instances [14:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:45] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.072 second response time [14:19:51] (03PS1) 10Jcrespo: mariadb: Repool db1081 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450994 (https://phabricator.wikimedia.org/T201132) [14:22:34] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1081 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450994 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:23:53] (03Merged) 10jenkins-bot: mariadb: Repool db1081 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450994 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:24:10] (03PS3) 10Andrew Bogott: site.pp: remove def for labvirt1021 and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/450638 [14:26:11] (03CR) 10jenkins-bot: mariadb: Repool es1019 and db1102 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450990 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:26:13] (03CR) 10jenkins-bot: mariadb: Repool db1081 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450994 (https://phabricator.wikimedia.org/T201132) (owner: 10Jcrespo) [14:26:32] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 with low load (duration: 00m 47s) [14:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:17] (03CR) 10Andrew Bogott: [C: 032] site.pp: remove def for labvirt1021 and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/450638 (owner: 10Andrew Bogott) [14:28:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:31:39] (03PS2) 10Jcrespo: mariadb: Repool es1019, db1122 and db1081 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450974 [14:34:30] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) [14:42:31] (03PS12) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [14:43:18] (03CR) 10jerkins-bot: [V: 04-1] WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 (owner: 10Bstorm) [14:46:56] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10jcrespo) 05Open>03Resolved Everything looking ok now. [14:48:07] (03PS13) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [14:48:45] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/450637 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [14:52:57] (03CR) 10Vgutierrez: [C: 032] standard: Remove chromium and hydrogen from ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450986 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [14:53:05] (03PS2) 10Vgutierrez: standard: Remove chromium and hydrogen from ntp peer list [puppet] - 10https://gerrit.wikimedia.org/r/450986 (https://phabricator.wikimedia.org/T196691) [14:53:07] (03CR) 10jerkins-bot: [V: 04-1] WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 (owner: 10Bstorm) [14:55:27] (03PS2) 10Herron: initial import of prometheus-logstash-exporter-0.1.2 [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/450637 (https://phabricator.wikimedia.org/T200362) [14:56:22] (03CR) 10Herron: [V: 032 C: 032] initial import of prometheus-logstash-exporter-0.1.2 (031 comment) [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/450637 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [14:57:21] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1078.eqiad.wmnet [14:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:30] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1064.eqiad.wmnet [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:56] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1079.eqiad.wmnet [14:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:06] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1054.eqiad.wmnet [14:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:01:53] (03PS1) 10Vgutierrez: conftool-data: Remove chromium & hydrogen from pdns_recursor service [puppet] - 10https://gerrit.wikimedia.org/r/451002 (https://phabricator.wikimedia.org/T196691) [15:02:06] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) Possibly related -- this should likely be implemented separately, but there's a slight chance that there's... [15:02:18] (03PS14) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [15:05:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:06:34] (03CR) 10Vgutierrez: [C: 032] conftool-data: Remove chromium & hydrogen from pdns_recursor service [puppet] - 10https://gerrit.wikimedia.org/r/451002 (https://phabricator.wikimedia.org/T196691) (owner: 10Vgutierrez) [15:09:50] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Imarlier) FWIW, I would prefer general access instead of having to ask someone to move files for me. There are a number of open Phab tickets that request sitemap generation for different wikis, includin... [15:15:05] (03PS1) 10PleaseStand: Don't use hex escapes for non-ASCII characters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451004 [15:15:36] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:16:23] ^^ that's me [15:19:09] (03PS1) 10BBlack: geo-maps: a few basic cleanups, no major impact [dns] - 10https://gerrit.wikimedia.org/r/451007 [15:19:18] ACKNOWLEDGEMENT - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Vgutierrez to be decommed [15:19:47] (03PS2) 10BBlack: geo-maps: a few basic cleanups, no major impact [dns] - 10https://gerrit.wikimedia.org/r/451007 [15:21:35] (03PS3) 10BBlack: geo-maps: a few basic cleanups, no major impact [dns] - 10https://gerrit.wikimedia.org/r/451007 [15:22:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:23:22] (03CR) 10BBlack: [C: 032] geo-maps: a few basic cleanups, no major impact [dns] - 10https://gerrit.wikimedia.org/r/451007 (owner: 10BBlack) [15:26:07] 10Operations, 10Maps: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter - https://phabricator.wikimedia.org/T150460 (10Gehel) First step here is to investigate how those metrics are published from application code and see if there is a config flag already in place to s... [15:26:15] 10Operations, 10Maps: publish kartotherian / tilerator metrics by cluster - https://phabricator.wikimedia.org/T150466 (10Jhernandez) This involves: * Looking into how the applications send data * Seeing if there is config there already to send Look at https://grafana.wikimedia.org/dashboard/db/maps-performan... [15:26:24] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1077.eqiad.wmnet [15:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:36] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1053.eqiad.wmnet [15:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:45] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog: publish kartotherian / tilerator metrics by cluster - https://phabricator.wikimedia.org/T150466 (10Jhernandez) [15:30:25] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10jcrespo) > I don't know if we will actually end up automating these runs going forward, but having the ability to do so if we find that sitemaps improve our search engine indexing would be extremely help... [15:33:33] (03PS5) 10Giuseppe Lavagetto: mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) [15:34:21] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Dzahn) @Cmjohnson I have said on the list it should not be used currently and then brought it back up anyways. So yes, it is ok to bring it down to check/fix the memory issue. I w... [15:34:30] 10Operations, 10Maps: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter - https://phabricator.wikimedia.org/T150460 (10Jhernandez) a:03Gehel [15:35:51] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:36:58] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Imarlier) @jcrespo Right, obviously this would end up in puppet if it were something that we were going to do as more than a one-off. But even when putting something in to puppet, not being able to see... [15:41:28] (03CR) 10Jforrester: [C: 031] Don't use hex escapes for non-ASCII characters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451004 (owner: 10PleaseStand) [15:43:16] (03PS3) 10Thcipriani: Scap: update logstash_checker.py mwdeploy query [puppet] - 10https://gerrit.wikimedia.org/r/449639 [15:51:01] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1019, db1122 and db1081 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450974 (owner: 10Jcrespo) [15:52:06] (03CR) 10Filippo Giunchedi: [C: 032] logstash: don't restart daily [puppet] - 10https://gerrit.wikimedia.org/r/450992 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [15:52:14] (03PS2) 10Filippo Giunchedi: logstash: don't restart daily [puppet] - 10https://gerrit.wikimedia.org/r/450992 (https://phabricator.wikimedia.org/T200960) [15:52:33] (03Merged) 10jenkins-bot: mariadb: Repool es1019, db1122 and db1081 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450974 (owner: 10Jcrespo) [15:55:15] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019, db1122 and db1081 with full weight (duration: 00m 51s) [15:55:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] (03CR) 10Imarlier: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/450468 (owner: 10Ori.livneh) [15:59:05] (03CR) 10jenkins-bot: mariadb: Repool es1019, db1122 and db1081 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450974 (owner: 10Jcrespo) [16:00:04] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:00:40] cmjohnson1: i have scheduled downtime in icinga for bast1002. you can go ahead there [16:00:57] (for the next 5 hours or so) [16:02:16] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:05:26] (03PS2) 10Dzahn: Declare and manage a /var/cache/coal_web dir [puppet] - 10https://gerrit.wikimedia.org/r/450468 (owner: 10Ori.livneh) [16:12:04] (03PS3) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [16:12:08] (03PS1) 10Jcrespo: prometheus_mysqld_exporter: Remove cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/451014 [16:12:14] (03CR) 10Dzahn: [C: 032] Declare and manage a /var/cache/coal_web dir [puppet] - 10https://gerrit.wikimedia.org/r/450468 (owner: 10Ori.livneh) [16:12:27] (03PS2) 10Jcrespo: prometheus_mysqld_exporter: Remove cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/451014 [16:13:42] (03CR) 10Jcrespo: [C: 032] prometheus_mysqld_exporter: Remove cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/451014 (owner: 10Jcrespo) [16:13:53] (03CR) 10Dzahn: [C: 032] "/etc/tmpfiles.d/coal-web.conf has been created on webperf*" [puppet] - 10https://gerrit.wikimedia.org/r/450468 (owner: 10Ori.livneh) [16:14:34] (03CR) 10Andrew Bogott: "There is still mysql on cloudcontrol1003 but we're hoping to move away from it next week." [puppet] - 10https://gerrit.wikimedia.org/r/451014 (owner: 10Jcrespo) [16:15:20] (03CR) 10Krinkle: [C: 031] "I can see this url in Google cache, but for myself, anything on that domain seems to time out?" [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [16:15:23] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5011.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [16:17:27] PROBLEM - Host bast1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:59] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10Epic: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889 (10Gehel) [16:18:03] 10Operations, 10Maps: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter - https://phabricator.wikimedia.org/T150460 (10Gehel) 05Open>03declined This is actually managed at the service module level and there is a note that statsd is actually eqiad only at this point... [16:19:28] (03PS1) 10Filippo Giunchedi: WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 [16:20:05] (03CR) 10jerkins-bot: [V: 04-1] WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 (owner: 10Filippo Giunchedi) [16:20:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2006.codfw.wmnet', 'elastic2025... [16:21:22] (03CR) 10Jcrespo: [C: 032] "This is production monitoring, and it wasn't working- even if it stays, we can re-add it again, but the problem was the version installed " [puppet] - 10https://gerrit.wikimedia.org/r/451014 (owner: 10Jcrespo) [16:22:23] (03PS2) 10Filippo Giunchedi: WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 [16:22:37] RECOVERY - Host bast1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [16:22:55] (03CR) 10jerkins-bot: [V: 04-1] WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 (owner: 10Filippo Giunchedi) [16:23:45] (03CR) 10Andrew Bogott: "no problem! Just wanted to make sure you know what's happening :)" [puppet] - 10https://gerrit.wikimedia.org/r/451014 (owner: 10Jcrespo) [16:24:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:25:58] (03PS3) 10Filippo Giunchedi: WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 [16:26:18] Eh, those 5xx spikes do not look good. [16:26:23] might be me [16:26:27] no idea why but might be [16:26:31] (03CR) 10jerkins-bot: [V: 04-1] WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 (owner: 10Filippo Giunchedi) [16:26:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:26:57] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10jcrespo) I think es1014 issue gone away (according to grafana)? [16:27:04] it is from wikiquote fwiw, likely related to the mw exceptions seen earlier [16:27:19] (03PS1) 10RobH: cloudvirt102[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/451019 (https://phabricator.wikimedia.org/T199125) [16:27:50] Thanks, so the fatals are from intentional scanning for ways to produce fatals. Thanks, np. [16:28:09] ?? [16:28:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:28:58] Ah ok, I think I see now [16:29:02] (03CR) 10RobH: [C: 032] cloudvirt102[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/451019 (https://phabricator.wikimedia.org/T199125) (owner: 10RobH) [16:29:48] Krinkle: Is there a ticket to refer to? [16:29:56] hoo: private - https://phabricator.wikimedia.org/T201411 [16:30:48] I've cc-ed you [16:33:46] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Still no good for me (at least between prometheus1004 and es1014). Provided all the requested info to Juniper and their answer so far is "bounce the port" which solved the... [16:34:21] so, Krinkle what I mean that our actionables were to check it was not successful/code checked was not a know vulnerability [16:34:57] I would NEVER test vulns without announcing it first [16:35:39] jynus: The one about timestamps/dates is not a vulnerability. In fact, the fatal comes from code checking originally written for checking this at the database level, so all is good there. It's bad input, with expected error, but not caught, so should be 200 with user error or some other not-5xx, but outcome will be the same effectively. [16:36:04] cool, feel free to make it public- I was just being cautions [16:36:07] Thanks [16:36:08] :-) [16:36:09] will do [16:36:19] specially when it is not my expertise [16:36:29] I am going for reaz now, bye! [16:37:56] (03CR) 10Jcrespo: [C: 032] "Actually my fault for not explaining the patch properly. Removing the monitoring on prometheus100X, not touching clooudcontrl1003." [puppet] - 10https://gerrit.wikimedia.org/r/451014 (owner: 10Jcrespo) [16:40:19] (03PS4) 10Filippo Giunchedi: WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 [16:41:56] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@65e6bb9]: bump to master, prep for deploy to cirrus servers [16:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:09] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Cmjohnson) - Swapped the DIMM in A1 to B1 to see if the error follows the DIMM, goes away or stays with the CPU. - While it was down, I moved to a non-10G rack, rack c6 and updat... [16:42:16] (03CR) 10jerkins-bot: [V: 04-1] WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 (owner: 10Filippo Giunchedi) [16:44:51] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@65e6bb9]: bump to master, prep for deploy to cirrus servers (duration: 02m 54s) [16:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:23] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2026.codfw.wmnet', 'elastic2025.codfw.wmnet', 'elastic2006.codfw.wmnet'] ``` an... [16:47:17] (03PS5) 10Filippo Giunchedi: WIP: add jmx_exporter to logstash [puppet] - 10https://gerrit.wikimedia.org/r/451018 [16:47:46] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet [16:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:56] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1052.eqiad.wmnet [16:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:33] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@22a50af]: bump to master, prep for deploy to cirrus servers, take two [16:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:47] (03PS6) 10Filippo Giunchedi: logstash: add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) [16:50:25] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@22a50af]: bump to master, prep for deploy to cirrus servers, take two (duration: 01m 51s) [16:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:18] !log Deployed patch for T201418 [16:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) So, I've gone ahead and updated the puppet repo for the installation, and they successfully PXE boot into the jessie install... [16:52:19] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) [16:53:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:53:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:56:46] 10Operations, 10Data-Services, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10bd808) @Bstorm can you work with @Imarlier to get him access to labstore1006/7 (this may require a new user security role for these hosts) and show him where to put things so that the... [16:58:52] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407 (10Andrew) a:05RobH>03Andrew Reassigning to myself to understand what's happening here. [16:59:14] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407 (10RobH) I'm not exactly sure what is being done here? It seems that labnodepool1002 no longer needs to serve in that role, and will be assigned a new hostnam... [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1700). [17:00:18] no parsoid deploy today [17:00:40] ORES won’t be deployed today. [17:00:51] PROBLEM - Elasticsearch HTTPS on elastic2026 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2026.codfw.wmnet [17:01:30] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5011.eqsin.wmnet'] ``` and were **ALL** successful. [17:01:42] (03PS1) 10Cmjohnson: removing mgmt dns for decom db hosts [dns] - 10https://gerrit.wikimedia.org/r/451027 (https://phabricator.wikimedia.org/T195484) [17:02:29] (03PS2) 10Cmjohnson: removing mgmt dns for decom db hosts [dns] - 10https://gerrit.wikimedia.org/r/451027 (https://phabricator.wikimedia.org/T195484) [17:02:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/12013/" [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [17:02:52] (03CR) 10Cmjohnson: [C: 032] removing mgmt dns for decom db hosts [dns] - 10https://gerrit.wikimedia.org/r/451027 (https://phabricator.wikimedia.org/T195484) (owner: 10Cmjohnson) [17:04:03] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10Cmjohnson) [17:04:14] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10Cmjohnson) 05Open>03Resolved [17:04:41] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1053 - https://phabricator.wikimedia.org/T194634 (10Cmjohnson) [17:04:56] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1053 - https://phabricator.wikimedia.org/T194634 (10Cmjohnson) 05Open>03Resolved [17:05:37] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10Cmjohnson) [17:05:56] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10Cmjohnson) 05Open>03Resolved [17:06:12] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1055 - https://phabricator.wikimedia.org/T194118 (10Cmjohnson) [17:06:24] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1055 - https://phabricator.wikimedia.org/T194118 (10Cmjohnson) 05Open>03Resolved [17:06:46] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736 (10Cmjohnson) [17:06:51] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736 (10Cmjohnson) 05Open>03Resolved [17:07:11] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5012.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [17:07:18] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1059 - https://phabricator.wikimedia.org/T196606 (10Cmjohnson) [17:07:25] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1059 - https://phabricator.wikimedia.org/T196606 (10Cmjohnson) 05Open>03Resolved [17:07:45] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1060 - https://phabricator.wikimedia.org/T193732 (10Cmjohnson) [17:08:01] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1060 - https://phabricator.wikimedia.org/T193732 (10Cmjohnson) 05Open>03Resolved [17:09:00] RECOVERY - Elasticsearch HTTPS on elastic2026 is OK: SSL OK - Certificate elastic2026.codfw.wmnet valid until 2023-08-06 17:07:48 +0000 (expires in 1824 days) [17:10:00] (03PS1) 10Cmjohnson: Removing mgmt dns for eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/451032 (https://phabricator.wikimedia.org/T189566) [17:10:45] (03PS1) 10RobH: adding torrelay1001 ipv6 entries [dns] - 10https://gerrit.wikimedia.org/r/451033 (https://phabricator.wikimedia.org/T196701) [17:11:24] (03PS2) 10RobH: adding torrelay1001 ipv6 entries [dns] - 10https://gerrit.wikimedia.org/r/451033 (https://phabricator.wikimedia.org/T196701) [17:11:41] (03CR) 10RobH: [C: 032] adding torrelay1001 ipv6 entries [dns] - 10https://gerrit.wikimedia.org/r/451033 (https://phabricator.wikimedia.org/T196701) (owner: 10RobH) [17:12:25] (03PS2) 10Cmjohnson: Removing mgmt dns for eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/451032 (https://phabricator.wikimedia.org/T189566) [17:13:19] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/451032 (https://phabricator.wikimedia.org/T189566) (owner: 10Cmjohnson) [17:14:55] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566 (10Cmjohnson) [17:15:03] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566 (10Cmjohnson) 05Open>03Resolved [17:15:08] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10Halfak) [17:15:10] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:16:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:16:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:17:55] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) 05Open>03Resolved [17:18:17] (03PS1) 10RobH: torrelay1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/451044 (https://phabricator.wikimedia.org/T196701) [17:18:32] 10Operations, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10Gehel) [17:19:01] 10Operations, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10EBernhardson) a:03Gehel [17:20:41] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:13] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [17:21:30] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Rename labnodepool1002.eqiad.wmnet as cloudservices1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201439 (10Andrew) p:05Triage>03Normal [17:21:52] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10Andrew) [17:22:01] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5012_v4, cp5012_v6 [17:22:15] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10Andrew) Thanks to T201439 I've just reduced the ask on this ticket from 2 servers to 1: cloudservices1004. [17:26:36] 10Operations, 10hardware-requests: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) [17:27:02] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) [17:27:06] 10Operations, 10cloud-services-team, 10hardware-requests: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) [17:28:37] 10Operations, 10cloud-services-team, 10decommission, 10hardware-requests: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) [17:28:42] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10RobH) [17:33:45] (03PS1) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/451050 (https://phabricator.wikimedia.org/T201440) [17:35:57] (03PS1) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/451051 (https://phabricator.wikimedia.org/T201440) [17:38:05] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10Cmjohnson) [17:38:21] 10Operations, 10cloud-services-team, 10decommission, 10hardware-requests, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) [17:39:40] 10Operations, 10cloud-services-team, 10decommission, 10hardware-requests, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) [17:42:52] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201439 (10RobH) [17:47:41] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 68 ESP OK [17:49:21] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5012.eqsin.wmnet'] ``` Of which those **FAILED**: ``` ['cp5012.eqsin.wmnet'] ``` [17:52:27] (03PS1) 10Cmjohnson: Removing mgmt dns mw1201-1220 [dns] - 10https://gerrit.wikimedia.org/r/451057 (https://phabricator.wikimedia.org/T185004) [17:53:35] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns mw1201-1220 [dns] - 10https://gerrit.wikimedia.org/r/451057 (https://phabricator.wikimedia.org/T185004) (owner: 10Cmjohnson) [17:54:07] 10Operations, 10DC-Ops, 10cloud-services-team, 10netops: Refresh switch ports descriptions for recently renamed cloud servers - https://phabricator.wikimedia.org/T201444 (10RobH) [17:55:15] (03CR) 10RobH: [C: 032] torrelay1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/451044 (https://phabricator.wikimedia.org/T196701) (owner: 10RobH) [18:01:05] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10thcipriani) >>! In T200690#4482670, @Tgr wrote: >> I wonder if there are git hooks we could setup on the deployment servers to address this without having to put any logic into scap? > > Sure (just chec... [18:02:21] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet [18:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:46] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1063.eqiad.wmnet [18:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:31] (03CR) 10Herron: [C: 031] "Looks good! one minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [18:13:17] !log netbox - temp dropped databae to test restoring from dump [18:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:49] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [18:24:08] !log netbox - restored database from dump file - backed up and back-up (T190184) [18:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:15] T190184: Netbox: setup backups - https://phabricator.wikimedia.org/T190184 [18:25:18] 10Operations, 10ops-eqiad: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10RobH) [18:26:15] 10Operations, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10RobH) a:05RobH>03Dzahn IRC Sync/Update: I've chatted with @dzahn via irc and he is expecting this task reassignment. He'll be handling pushing this into service, and filing a #decom task for... [18:34:24] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Nemo_bis) > To me it still seems the easiest solution would be to put this on a separate wiki. This was... [18:42:58] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Jdforrester-WMF) Is this now Resolved? [18:44:57] 10Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) [18:45:03] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199 (10Jdforrester-WMF) [18:45:11] 10Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) [18:45:16] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199 (10Jdforrester-WMF) [18:45:47] 10Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) >>! In T147967#2710596, @BBlack wrote: > I'd suggest blocking this on the seemingly-unrelated T1471... [18:54:47] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) a:03awight [18:55:02] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) a:03awight [18:58:07] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @Nemo_bis, thanks for chiming in. There are a lot of concerns I have about a central wiki from a... [18:59:08] (03PS1) 10Gergő Tisza: Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T190015) [18:59:53] (03PS2) 10Herron: prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) [19:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T1900). [19:02:39] (03CR) 10Jforrester: Enable TemplateStyles everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [19:02:47] (03PS2) 10Gergő Tisza: Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T199909) [19:03:13] * James_F grins at tgr|away. [19:03:44] !log Branching 1.32.0-wmf.16 [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] I'm sure the linux dual cliboard thing was somebody's epic revenge on humanity [19:06:56] !log otto@deploy1001 Started deploy [eventstreams/deploy@07033d4]: Deploying eventstreams with timestamp in Last-Event-ID (scb2001 only) [19:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:15] !log otto@deploy1001 Finished deploy [eventstreams/deploy@07033d4]: Deploying eventstreams with timestamp in Last-Event-ID (scb2001 only) (duration: 00m 20s) [19:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:59] !log otto@deploy1001 Started deploy [eventstreams/deploy@07033d4]: Deploying eventstreams with timestamp in Last-Event-ID (all nodes) [19:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:18] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/12015/" [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [19:09:50] !log otto@deploy1001 Finished deploy [eventstreams/deploy@07033d4]: Deploying eventstreams with timestamp in Last-Event-ID (all nodes) (duration: 01m 51s) [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] hallo [19:10:56] (03PS1) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [19:11:01] I need to update my ssh keys for accessing stat1005 and mwmaint1001 [19:11:10] can anybody please remind how do I do that? [19:12:21] aharoni: iirc create a ticket with the appropriate new ssh key and the person on ops clinic duty will see it, verify you are you, and update puppet [19:12:47] not sure which queue, ops-access-requests doesn't seem quite right [19:13:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [19:14:43] tagging operations and sre-access-requests will do the trick [19:15:09] (03PS1) 10Ottomata: EventStreams now supports multi DC, but still run from main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) [19:15:42] (03Abandoned) 10Andrew Bogott: rough draft of etcd for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/449192 (owner: 10Andrew Bogott) [19:17:32] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) [19:21:52] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10RobH) [19:22:55] (03CR) 10Ppchelko: EventStreams now supports multi DC, but still run from main-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [19:26:56] ebernhardson: I'm dumb about security :) [19:27:24] Pchelolo: yeah, but i'd expect more annoyance with switching from maintenance on MirrorMaker than from DC switchover [19:27:25] so I have to check: id_rsa.pub is the one that I can share publicly, right? [19:27:30] oops wrong channel... [19:27:58] (03PS1) 10Andrew Bogott: Rename labnodepool1002 to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/451084 (https://phabricator.wikimedia.org/T201439) [19:28:21] (03PS2) 10Andrew Bogott: makedomain: add --delete and --all functions [puppet] - 10https://gerrit.wikimedia.org/r/450875 [19:30:43] !log shutting down labnodepool1002 in advance of a rename. T201439 [19:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:48] T201439: rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 [19:30:58] (03CR) 10Andrew Bogott: [C: 032] makedomain: add --delete and --all functions [puppet] - 10https://gerrit.wikimedia.org/r/450875 (owner: 10Andrew Bogott) [19:31:15] (03PS2) 10Andrew Bogott: Rename labnodepool1002 to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/451084 (https://phabricator.wikimedia.org/T201439) [19:32:25] (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451087 [19:32:27] (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451087 (owner: 1020after4) [19:33:37] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) [19:33:39] 10Operations, 10Patch-For-Review: Netbox: setup backups - https://phabricator.wikimedia.org/T190184 (10Dzahn) 05Open>03Resolved I also tested actual restore of the database: Dropped the live prod database, confirmed web UI was down, then restored DB from dumpfile from backups and the application was up ag... [19:33:42] bblack bstorm_ Reedy ^ [19:34:20] ? [19:34:20] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) [19:34:26] (03CR) 10Andrew Bogott: [C: 032] Rename labnodepool1002 to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/451084 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [19:34:27] You can create the gerrit patch yourself ;P [19:35:55] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) [19:36:30] (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451087 (owner: 1020after4) [19:36:52] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.32.0-wmf.16 refs T191062 [19:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:57] T191062: 1.32.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T191062 [19:37:05] Reedy: I need to update my ssh keys for accessing stat1005 and mwmaint1001. Should I create a Phab task for that and paste the .pub keys there? [19:37:11] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) Subtask to setup backups is now resolved. Incl. testing restore of files from Bacula console back to both netmon servers and dropping the psql database for netb... [19:37:21] aharoni: Yes, but you can also make the gerrit patch yourself if you want [19:37:59] Reedy: I reinstalled my laptop, and haven't fully configured Gerrit yet. [19:38:06] Web interface editing? :) [19:38:09] 10Operations: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10Amire80) [19:38:27] Reedy: looks sensible? ^ [19:38:47] Why two keys/ [19:39:27] Reedy: If I recally correctly, different keys are needed for production and stat servers... Am I remembering incorrectly? [19:39:31] You should be explicit if you want them adding as additionals, or to replace the existing ones [19:39:39] You need different ones for cloud and prod [19:39:45] AFAIK stat shouldn't need a seperate one [19:39:50] 10Operations: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10Amire80) [19:40:00] OK, and do they have to be different from Gerrit? [19:40:12] Yes, gerrit is the same as cloud [19:40:57] Reedy: OK, I removed one [19:40:59] 10Operations: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10Amire80) [19:44:24] (03PS1) 10Jforrester: Beta Cluster: Enable CSP in report-only mode on all BC wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451089 [19:45:25] !log upload prometheus-logstash-exporter_0.1.2-1 to apt.wikimedia.org/stretch-wikimedia/main [19:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:29] !log upload prometheus-logstash-exporter_0.1.2-1 to apt.wikimedia.org/jessie-wikimedia/main [19:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:30] (03CR) 10Jforrester: "So we can sanity-check in a production-like environment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451089 (owner: 10Jforrester) [19:47:16] (03PS1) 10Andrew Bogott: rename labnodepool1002.mgmt to cloudservices1003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/451090 [19:47:18] (03PS1) 10Andrew Bogott: Move labnodepool1002.eqiad.wmnet to cloudservices1003.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/451091 [19:47:49] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) `prometheus-logstash-exporter_0.1.2-1` has been uploaded to `apt.wikimedia.org/jessie-wikimedia/main` and `apt.wikimedia.... [19:48:23] (03PS2) 10Andrew Bogott: rename labnodepool1002.mgmt to cloudservices1003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/451090 (https://phabricator.wikimedia.org/T201439) [19:48:25] (03PS2) 10Andrew Bogott: Move labnodepool1002.eqiad.wmnet to cloudservices1003.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/451091 (https://phabricator.wikimedia.org/T201439) [19:48:47] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Krinkle) I haven't thought of using the screen command directly over ssh. That's neat. My setup is to have the `screen -DR` command in `~/.bash_profile` remotely. With that in place, any new login from... [19:51:06] (03CR) 10RobH: [C: 031] Move labnodepool1002.eqiad.wmnet to cloudservices1003.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/451091 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [19:51:23] (03CR) 10RobH: [C: 031] rename labnodepool1002.mgmt to cloudservices1003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/451090 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [19:51:57] (03CR) 10Andrew Bogott: [C: 032] rename labnodepool1002.mgmt to cloudservices1003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/451090 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [19:52:04] (03CR) 10Andrew Bogott: [C: 032] Move labnodepool1002.eqiad.wmnet to cloudservices1003.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/451091 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [19:52:06] (03PS2) 10Ottomata: EventStreams now supports multi DC, but should run active/passive [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) [19:55:00] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) [19:55:04] (03CR) 10Ottomata: "Volans does this make sense?" [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [19:58:36] (03CR) 10BryanDavis: "> I skipped the comment on purpose, since a phabricator reference is" [puppet] - 10https://gerrit.wikimedia.org/r/450610 (https://phabricator.wikimedia.org/T197176) (owner: 10BryanDavis) [19:58:43] (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451087 (owner: 1020after4) [20:02:54] (03CR) 10BryanDavis: "> Shall I merge + build this?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [20:05:27] (03CR) 10Gehel: "A few comments inline already. I'll probably add a few more tomorrow." (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:06:40] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004 (10Cmjohnson) [20:06:48] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004 (10Cmjohnson) 05Open>03Resolved [20:06:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Cmjohnson) [20:12:08] (03PS1) 10Andrew Bogott: cloudservices1003: try a different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/451100 (https://phabricator.wikimedia.org/T201439) [20:12:32] 10Operations: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10Amire80) [20:12:32] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host terbium [dns] - 10https://gerrit.wikimedia.org/r/451101 (https://phabricator.wikimedia.org/T200763) [20:12:58] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host terbium [dns] - 10https://gerrit.wikimedia.org/r/451101 (https://phabricator.wikimedia.org/T200763) (owner: 10Cmjohnson) [20:13:05] (03CR) 10Andrew Bogott: [C: 032] cloudservices1003: try a different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/451100 (https://phabricator.wikimedia.org/T201439) (owner: 10Andrew Bogott) [20:13:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decom/reclaim terbium - https://phabricator.wikimedia.org/T200763 (10Cmjohnson) [20:13:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decom/reclaim terbium - https://phabricator.wikimedia.org/T200763 (10Cmjohnson) 05Open>03Resolved [20:15:10] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host snapshot1001 [dns] - 10https://gerrit.wikimedia.org/r/451102 (https://phabricator.wikimedia.org/T197021) [20:15:40] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host snapshot1001 [dns] - 10https://gerrit.wikimedia.org/r/451102 (https://phabricator.wikimedia.org/T197021) (owner: 10Cmjohnson) [20:17:02] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-ArielGlenn: decommission snapshot1001 - https://phabricator.wikimedia.org/T197021 (10Cmjohnson) [20:17:09] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-ArielGlenn: decommission snapshot1001 - https://phabricator.wikimedia.org/T197021 (10Cmjohnson) 05Open>03Resolved [20:23:22] (03PS1) 10Amire80: Replace ssh keys for amire80 [puppet] - 10https://gerrit.wikimedia.org/r/451105 (https://phabricator.wikimedia.org/T201454) [20:24:52] (03PS1) 10Cmjohnson: Removing mgmt dns for decom hosts ocg1001-3 [dns] - 10https://gerrit.wikimedia.org/r/451106 (https://phabricator.wikimedia.org/T177958) [20:25:45] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom hosts ocg1001-3 [dns] - 10https://gerrit.wikimedia.org/r/451106 (https://phabricator.wikimedia.org/T177958) (owner: 10Cmjohnson) [20:27:01] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958 (10Cmjohnson) [20:27:14] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958 (10Cmjohnson) 05Open>03Resolved [20:27:16] 10Operations, 10OCG-General, 10Patch-For-Review, 10Services (watching): Decommission OCG from production - https://phabricator.wikimedia.org/T177931 (10Cmjohnson) [20:28:23] (03PS1) 10Krinkle: webperf: Switch arclamp_host in Beta from mwlog host to webperf13 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) [20:29:58] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host poolcounter1002 [dns] - 10https://gerrit.wikimedia.org/r/451108 (https://phabricator.wikimedia.org/T193025) [20:30:29] (03CR) 10Brian Wolff: "The proposed setting for production is not true, but setting useNonce => false (or whatever the name is)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451089 (owner: 10Jforrester) [20:34:42] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1062.eqiad.wmnet [20:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:21] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1050.eqiad.wmnet [20:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:05] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host stat1002 [dns] - 10https://gerrit.wikimedia.org/r/451178 (https://phabricator.wikimedia.org/T173097) [20:37:30] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host stat1002 [dns] - 10https://gerrit.wikimedia.org/r/451178 (https://phabricator.wikimedia.org/T173097) (owner: 10Cmjohnson) [20:37:54] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host poolcounter1002 [dns] - 10https://gerrit.wikimedia.org/r/451108 (https://phabricator.wikimedia.org/T193025) (owner: 10Cmjohnson) [20:38:35] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 53.14, 28.51, 19.56 [20:39:11] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097 (10Cmjohnson) [20:39:16] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.17, 27.59, 19.37 [20:40:25] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 28.26, 25.59, 19.20 [20:40:35] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 21.57, 24.88, 19.33 [20:40:36] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 341.78 seconds [20:42:19] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094 (10Cmjohnson) [20:42:27] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097 (10Cmjohnson) 05Open>03Resolved This server was given to Stroz and we have a copy of the hard drive in the eqiad data center on an encrypted drive,... [20:45:06] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:13] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.32.0-wmf.16 refs T191062 (duration: 68m 20s) [20:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:18] T191062: 1.32.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T191062 [21:07:11] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1049.eqiad.wmnet [21:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:35] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:11:56] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:11:56] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:12:36] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:13:36] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:13:56] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:05] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:36] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:45] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:51] (03PS1) 10Andrew Bogott: Added DNS for dns on cloudservices1003: labs-ns2 and labs-recursor2 [dns] - 10https://gerrit.wikimedia.org/r/451192 [21:16:15] (03CR) 10Andrew Bogott: [C: 032] Added DNS for dns on cloudservices1003: labs-ns2 and labs-recursor2 [dns] - 10https://gerrit.wikimedia.org/r/451192 (owner: 10Andrew Bogott) [21:17:19] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.16 refs T191062 [21:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:24] T191062: 1.32.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T191062 [21:19:01] (03PS1) 10Andrew Bogott: labs-recursor2: fixed typo [dns] - 10https://gerrit.wikimedia.org/r/451193 [21:19:06] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:19:06] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:19:09] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) Created a self dispatch with Dell for a new DIMM. You have successfully submitted request SR977877163. [21:19:12] this isn't in the new branch but nonetheless I'm seeing quite a few of "Model contains an error for 29888930: TimeoutError" from ORES [21:19:37] (03CR) 10Andrew Bogott: [C: 032] labs-recursor2: fixed typo [dns] - 10https://gerrit.wikimedia.org/r/451193 (owner: 10Andrew Bogott) [21:19:45] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:19:57] twentyafterfour: In logstash? Thanks, I’ll look into it. [21:20:02] T201412 [21:20:03] T201412: ORES Storage::SqlScoreStorage exception every 2-3 minutes: Model contains an error for [id]: TimeoutError - https://phabricator.wikimedia.org/T201412 [21:20:07] awight: yeah in logstash... [21:20:16] and someone beat me to reporting it ^ [21:20:18] ;) [21:20:20] Great, thanks for making the bug! [21:20:30] ah hehe thanks for finding the prefab bug :p [21:20:36] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:21:03] "Model contains an error for " String formatting not working? [21:21:21] Oh I see! [21:22:05] TimeoutErrors are up a bit in Eqiad. [21:22:07] Nothing scary. [21:22:22] 5xx rate looks good. [21:22:54] (03PS1) 10Andrew Bogott: Added some placeholder hiera settings for eqiad1 designate [puppet] - 10https://gerrit.wikimedia.org/r/451194 (https://phabricator.wikimedia.org/T199578) [21:22:56] One of the revids that timed out seems to work now and not timeout. [21:23:45] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:23:45] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:24:05] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:24:06] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:24:06] (03CR) 10Andrew Bogott: [C: 032] Added some placeholder hiera settings for eqiad1 designate [puppet] - 10https://gerrit.wikimedia.org/r/451194 (https://phabricator.wikimedia.org/T199578) (owner: 10Andrew Bogott) [21:24:39] awight, does it look like those errors are still coming in? [21:25:37] checking... [21:26:11] ores1006 and ores1009 seem to have a pronounced change in timeout errors. [21:26:22] While other nodes in eqiad seem more normal. [21:26:53] (03PS1) 10Andrew Bogott: eqiad1 designate: fix search-and-replace fail [puppet] - 10https://gerrit.wikimedia.org/r/451196 (https://phabricator.wikimedia.org/T199578) [21:27:18] * awight growls at inability to use wildcards [21:27:36] (03CR) 10Andrew Bogott: [C: 032] eqiad1 designate: fix search-and-replace fail [puppet] - 10https://gerrit.wikimedia.org/r/451196 (https://phabricator.wikimedia.org/T199578) (owner: 10Andrew Bogott) [21:28:53] (03PS1) 10Andrew Bogott: eqiad1 designate: yet more hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/451197 (https://phabricator.wikimedia.org/T199578) [21:29:08] Is https://grafana-admin.wikimedia.org/ not a thing anymore? [21:29:52] (03CR) 10Andrew Bogott: [C: 032] eqiad1 designate: yet more hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/451197 (https://phabricator.wikimedia.org/T199578) (owner: 10Andrew Bogott) [21:29:56] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:30:00] 10Operations, 10Wikimedia-Mailing-lists: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10JTannerWMF) [21:30:07] Hallo. Can anybody take a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451105/ please? [21:33:11] (03CR) 10Ppchelko: [C: 031] "If this does what we think it does - I'm +1 conceptually." [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [21:33:27] (03CR) 10MarcoAurelio: [C: 031] Replace ssh keys for amire80 [puppet] - 10https://gerrit.wikimedia.org/r/451105 (https://phabricator.wikimedia.org/T201454) (owner: 10Amire80) [21:34:19] * awight hits logstash with a bowling ball in purse [21:35:10] halfak: Yeah I think grafana-admin was merged into grafana due to some kind of authentication improvement. [21:36:17] gotcha. [21:36:18] Nice. [21:36:35] I'm looking at timeout errors. It looks like the situation has been rough for almost 24 hours. [21:36:58] halfak: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@dfeae89&_a=h@edd3048 kk sounds like you have a better query [21:37:04] And I was wrong, it affects all of eqiad. [21:37:12] So, this isn’t a ORES timeout, it seems to be a SQL timeout. [21:37:16] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @robh I am circling back to the labvirts and the new controllers did include batteries and they are connected to the cards. They were the exact same battery as the old card. [21:37:20] And the retry behavior is terrible [21:37:25] CODFW seems totally fine. [21:37:35] PROBLEM - Check systemd state on cp5012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:37:39] Could totally be. We'll timeout if the API takes too long. [21:39:44] Oops, I’m wrong—this is an ORES timeout, but not interpreted until we try to parse during storing to SQL. Gross. [21:46:22] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) paste of the lspci output: P7435 This shows: af:00.0 Ethernet controller: QLogic Corp. Device 8070 (rev 02) af:00.1 Ethernet controller: QLogic... [21:47:26] awight, I think we're hitting an API slowdown. [21:47:35] And it's only noticeable in eqiad. [21:47:40] Any good way to check that? [21:48:27] 10Operations: Deactivate Chad's Racktables account - https://phabricator.wikimedia.org/T196787 (10RobH) 05Open>03Resolved done [21:48:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10Dzahn) [21:49:53] halfak: Hopefully we still have API timings for changeprop [21:50:21] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) I created a ticket with HP....this should be fun Case ID: 5331584481 [21:51:01] halfak: Lots of MW API timeout errors, it seems [21:51:28] Yeah. I think we're just downstream of this. [21:51:37] Our CPU/Memory usage seems stable and nominal. [21:52:55] Changeprop is steady over the past month fwiw, so we’re probably not the cause [21:53:27] Check this out: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=11&fullscreen&orgId=1&from=now-7d&to=now-1m [21:54:02] Did we do a deployment yesterday? [21:54:02] Timout errors are high since Aug 6, 07h15 roughly [21:54:04] nope [21:54:25] Yeah. I think we're just downstream. Who is the right person to ping? [21:54:29] Remind me, this graph is of MediaWiki timeouts that we receive, right? [21:54:59] fyi https://grafana.wikimedia.org/dashboard/db/api-backend-summary?refresh=5m&orgId=1&from=now-7d&to=now [21:55:48] Our time correlates to a step change in those graphs [21:55:51] (03PS1) 10RobH: setup new rdb10(09|10).eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/451201 (https://phabricator.wikimedia.org/T196685) [21:56:47] (03CR) 10RobH: [C: 032] setup new rdb10(09|10).eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/451201 (https://phabricator.wikimedia.org/T196685) (owner: 10RobH) [21:57:18] I don’t see anything obvious in the SAL, and don’t actually know who to corner about this. [21:58:21] so it's an api you're calling that is the source of the timeouts? [21:59:47] twentyafterfour: We think so, and you can see some kind of load increasing since yesterday c. 07h15 UTC [22:03:28] halfak: I double-checked our timeout metric and it’s a bit trickier than what I said earlier. It’s actually just a general timeout around score processing, so most likely caused by MW latency but we can’t say for sure. [22:03:29] awight, out TimeoutError can be generated internally by a long CPU process or a long IO process. [22:03:30] (03CR) 10Gehel: Add cookbook entry point script (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [22:03:37] Our only long IO processes are MW API calls [22:03:45] And a long CPU process would show up in our metrics. [22:04:05] So I'm left with only the MW API calls when interpreting that graph -- but it is not a direct mapping. [22:05:14] Thanks, that makes sense [22:05:31] PROBLEM - Recursive DNS on 208.80.154.143 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:06:03] ^ that IP is labs-recursor2 [22:09:01] PROBLEM - Auth DNS on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:10:42] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:10:46] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10thcipriani) [22:12:22] PROBLEM - Check for gridmaster host resolution UDP on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:13:50] (03PS1) 10RobH: fixing typo in rdb1010 entry [puppet] - 10https://gerrit.wikimedia.org/r/451203 (https://phabricator.wikimedia.org/T196685) [22:14:18] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1048.eqiad.wmnet [22:14:20] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) Note that we already have: %releasers-mediawiki ALL = (jenkins) NOPASSWD: ALL %releasers-mediawiki ALL = NOPASSWD: /usr/sbin/service jenkins * Which c... [22:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:34] (03CR) 10RobH: [C: 032] fixing typo in rdb1010 entry [puppet] - 10https://gerrit.wikimedia.org/r/451203 (https://phabricator.wikimedia.org/T196685) (owner: 10RobH) [22:16:15] (03PS1) 10Dzahn: design.wm.org: add apache redirect for style-guide/wiki/ [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) [22:26:44] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10thcipriani) >>! In T201470#4486823, @Dzahn wrote: > Which could run any command as jenkins. I understand you need more than the jenkins user though? For upgra... [22:32:45] 10Operations, 10Wikimedia-Mailing-lists: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10JTannerWMF) [22:34:55] (03CR) 10VolkerE: [C: 031] design.wm.org: add apache redirect for style-guide/wiki/ [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) (owner: 10Dzahn) [22:45:31] (03PS1) 10RobH: rdb1009 mac correction [puppet] - 10https://gerrit.wikimedia.org/r/451205 [22:50:09] (03PS1) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork for php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/451206 [22:50:39] (03PS1) 10Andrew Bogott: eqiad1 pdns: specify an ip for the pdns database [puppet] - 10https://gerrit.wikimedia.org/r/451207 (https://phabricator.wikimedia.org/T199578) [22:50:55] (03CR) 10jerkins-bot: [V: 04-1] httpd: fix mpm_event module conflict with mpm_prefork for php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [22:51:33] (03CR) 10Andrew Bogott: [C: 032] eqiad1 pdns: specify an ip for the pdns database [puppet] - 10https://gerrit.wikimedia.org/r/451207 (https://phabricator.wikimedia.org/T199578) (owner: 10Andrew Bogott) [22:53:37] (03CR) 10RobH: [C: 032] rdb1009 mac correction [puppet] - 10https://gerrit.wikimedia.org/r/451205 (owner: 10RobH) [22:53:45] (03PS2) 10RobH: rdb1009 mac correction [puppet] - 10https://gerrit.wikimedia.org/r/451205 [22:56:49] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) [22:56:58] (03PS2) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 [22:57:31] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) a:05Andrew>03Cmjohnson This server is up and puppetized, with one puppet error which is T20... [22:57:50] (03CR) 10jerkins-bot: [V: 04-1] httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180807T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:17] (03PS3) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 [23:01:01] (03CR) 10Paladox: [C: 031] httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [23:01:04] (03CR) 10jerkins-bot: [V: 04-1] httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [23:02:29] (03PS1) 10Andrew Bogott: cloudservices1003: added ipv6 addresses [dns] - 10https://gerrit.wikimedia.org/r/451208 [23:03:26] (03CR) 10Andrew Bogott: [C: 032] cloudservices1003: added ipv6 addresses [dns] - 10https://gerrit.wikimedia.org/r/451208 (owner: 10Andrew Bogott) [23:04:22] RECOVERY - Recursive DNS on 208.80.154.143 is OK: DNS OK: 0.189 seconds response time. www.wikipedia.org returns 208.80.154.224 [23:06:17] (03CR) 10Krinkle: "See also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449532/ which might be related." [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [23:07:53] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott I believe these will recover when DNS refreshes. [23:07:53] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott I believe these will recover when DNS refreshes. [23:16:22] (03Abandoned) 10Krinkle: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [23:16:34] (03Abandoned) 10Krinkle: Consolidate all of the simple wikimedia.org VHosts into two [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [23:16:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10RobH) This is getting a failure on the preseed when attempting installation. I tried to pull the installer logs, but they won't mount via t... [23:17:28] (03PS4) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 [23:18:15] (03CR) 10jerkins-bot: [V: 04-1] httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (owner: 10Dzahn) [23:19:23] (03PS5) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) [23:28:01] !log restarted populateContentTables.php for commonswiki, it died at rev_id 60164000 [23:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:03] ACKNOWLEDGEMENT - HTTPS-policy on policy.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate policy.wikimedia.org valid until 2018-09-05 23:59:59 +0000 (expires in 29 days) daniel_zahn https://phabricator.wikimedia.org/T172210 [23:44:21] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [23:52:06] (03PS1) 10RobH: fixing netboot.cfg regex [puppet] - 10https://gerrit.wikimedia.org/r/451215 [23:55:07] (03CR) 10RobH: [C: 032] fixing netboot.cfg regex [puppet] - 10https://gerrit.wikimedia.org/r/451215 (owner: 10RobH)