[00:30:59] thedj: Hm.. which resources are you referring to at https://phabricator.wikimedia.org/T201772#4497359 ? [00:33:15] We currently have Varnish maxage at 5-7 days for all cacheable MW resources, with client-side age 0 for html, client-side age 30 days for versioned JS/CSS urls, and 5min for unversioned JS/CSS urls (which is mostly the skin stylesheet and startup manifest only). [01:45:10] !log on mwmaint1001 running populateContentTables.php on all wikis using ~tstarling/pct-list T183488 [01:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:18] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [02:36:13] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 14m 21s) [02:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:41] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Aug 13 02:46:41 UTC 2018 (duration 10m 28s) [02:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 815.52 seconds [03:45:30] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 218.38 seconds [04:42:50] PROBLEM - HHVM rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:41] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 80261 bytes in 0.093 second response time [06:21:42] (03PS12) 10Giuseppe Lavagetto: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [06:24:11] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [06:24:55] (03PS10) 10Giuseppe Lavagetto: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [06:26:31] PROBLEM - DPKG on restbase2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:26:51] PROBLEM - SSH on restbase2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:27:06] <_joe_> ouch [06:27:10] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:11] PROBLEM - Restbase root url on restbase2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:31] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:01] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:28:13] <_joe_> I suspect network issues tbh [06:28:40] PROBLEM - cassandra-b service on restbase2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:28:50] PROBLEM - cassandra-a service on restbase2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:28:52] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:28:53] <_joe_> we had a ton of issues in eqiad that just resolved before alarming [06:29:21] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:29:30] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:29:40] PROBLEM - Check size of conntrack table on restbase2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:29:41] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/bin/swift-drive-audit] [06:30:01] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [06:30:22] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.136 port 9042 [06:30:50] PROBLEM - cassandra-a service on restbase2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:51] PROBLEM - configured eth on restbase2003 is CRITICAL: NRPE: Unable to read output [06:31:00] PROBLEM - Disk space on restbase2003 is CRITICAL: NRPE: Unable to read output [06:31:00] PROBLEM - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:31:01] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [06:31:01] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [06:32:41] PROBLEM - Host restbase2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:34:23] <_joe_> !log powercycling restbase2003, in kernel panic [06:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:06] !log oblivian@neodymium conftool action : set/pooled=inactive; selector: name=restbase2003.codfw.wmnet [06:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:07] !log finish up cache reboots for kernel updates: cp5001.eqsin.wmnet,cp3035.esams.wmnet,cp[4021-4022].ulsfo.wmnet [06:53:10] RECOVERY - Host restbase2003 is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms [06:53:11] RECOVERY - Check size of conntrack table on restbase2003 is OK: OK: nf_conntrack is 0 % full [06:53:11] RECOVERY - DPKG on restbase2003 is OK: All packages OK [06:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:30] 10Operations, 10ops-codfw: restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 (10Joe) [06:53:31] RECOVERY - Disk space on restbase2003 is OK: DISK OK [06:53:31] RECOVERY - SSH on restbase2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [06:53:31] RECOVERY - configured eth on restbase2003 is OK: OK - interfaces up [06:53:50] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 16052 bytes in 0.144 second response time [06:55:20] RECOVERY - cassandra-b service on restbase2003 is OK: OK - cassandra-b is active [06:55:30] RECOVERY - cassandra-a service on restbase2003 is OK: OK - cassandra-a is active [06:55:50] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [06:55:52] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [06:56:01] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:56:11] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.136 and port 9042: Connection refused [06:56:12] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:56:12] PROBLEM - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:57:02] <_joe_> heh [06:57:06] <_joe_> the disk is broken [06:57:20] 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T201761 (10jcrespo) a:03Papaul This is a different this than last time, please replace this 600 GB disk with any spare you have, thanks. [06:58:01] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:30] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:31] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:50] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:02:17] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10jcrespo) 05Open>03declined The host will be decommission as soon as a new codfw x1 host is purchased: T184888 [07:03:48] 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T201761 (10jcrespo) If for some reason there wouldn't be spares, we could use some disk from db2033, but we prefer a new one as this is a master. [07:05:32] (03PS1) 10Muehlenhoff: Add varnent to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/452305 [07:06:26] (03CR) 10Muehlenhoff: [C: 032] Add varnent to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/452305 (owner: 10Muehlenhoff) [07:06:30] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2020-06-24 13:01:33 +0000 (expires in 681 days) [07:06:40] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2020-06-24 13:01:32 +0000 (expires in 681 days) [07:07:11] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [07:07:20] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [07:07:41] RECOVERY - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-c valid until 2020-06-24 13:01:33 +0000 (expires in 681 days) [07:08:41] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.136 port 9042 [07:12:53] (03CR) 10Volans: [C: 04-2] "I started to add the menu capabilities and found myself in refactoring quite a bit of this. At this point I think is better to wait a bit " [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:16:09] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.16/maintenance/dumpTextPass.php: Add the 'full' option explicitly to dumpTextPass.php (T201803) (duration: 00m 58s) [07:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:16] T201803: wikidata dumps broken with 'unexpected option: --full!' for revision history content dumps - https://phabricator.wikimedia.org/T201803 [08:15:46] 10Operations, 10Traffic, 10Patch-For-Review: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865 (10ema) 05Open>03Resolved a:03ema [08:16:08] (03CR) 10Ema: [C: 031] Enable microcode for LVS load balancers [puppet] - 10https://gerrit.wikimedia.org/r/451858 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:16:49] (03PS2) 10Muehlenhoff: Enable microcode for LVS load balancers [puppet] - 10https://gerrit.wikimedia.org/r/451858 (https://phabricator.wikimedia.org/T127825) [08:17:28] (03PS3) 10Ema: Varnish: Unset X-Request-Id for external requests [puppet] - 10https://gerrit.wikimedia.org/r/451240 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [08:17:30] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:17:44] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for LVS load balancers [puppet] - 10https://gerrit.wikimedia.org/r/451858 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:18:58] mmh the alert "MediaWiki memcached error rate" fired a few times in the last days, how to react to it? [08:19:28] (03PS4) 10Ema: Varnish: Unset X-Request-Id for external requests [puppet] - 10https://gerrit.wikimedia.org/r/451240 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [08:19:40] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:20:24] (03CR) 10Ema: [C: 032] Varnish: Unset X-Request-Id for external requests [puppet] - 10https://gerrit.wikimedia.org/r/451240 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [08:26:46] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2001.codfw.wmnet'] ``` The log can... [08:34:06] (03PS1) 10Jcrespo: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452320 [08:35:26] (03PS1) 10Ema: phabricator/otrs: use cache::text::nodes for mod_remoteip [puppet] - 10https://gerrit.wikimedia.org/r/452321 (https://phabricator.wikimedia.org/T164609) [08:36:15] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452320 (owner: 10Jcrespo) [08:37:11] <_joe_> incoming [08:37:31] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 [08:37:33] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: add ServerAlias support [puppet] - 10https://gerrit.wikimedia.org/r/451256 [08:37:35] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 [08:37:37] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [08:37:39] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) [08:37:41] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) [08:37:43] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) [08:37:45] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert simple wikis in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) [08:37:47] (03Merged) 10jenkins-bot: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452320 (owner: 10Jcrespo) [08:38:28] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::vhost: add ServerAlias support [puppet] - 10https://gerrit.wikimedia.org/r/451256 (owner: 10Giuseppe Lavagetto) [08:39:33] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:s7 and db1101:s8 (duration: 00m 51s) [08:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] !log upgrade and restart db1101 [08:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:54] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler03/12051/mw1261.eqiad.wmnet/ seems ok at first sight" [puppet] - 10https://gerrit.wikimedia.org/r/451255 (owner: 10Giuseppe Lavagetto) [08:43:41] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 [08:45:28] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) >>! In T164609#4483549, @Joe wrote: > Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit; knowing the origin of the 5xxs in broad c... [08:45:38] (03CR) 10jenkins-bot: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452320 (owner: 10Jcrespo) [08:48:15] (03CR) 10Giuseppe Lavagetto: [C: 031] phabricator/otrs: use cache::text::nodes for mod_remoteip [puppet] - 10https://gerrit.wikimedia.org/r/452321 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:49:04] (03CR) 10Ema: [C: 032] phabricator/otrs: use cache::text::nodes for mod_remoteip [puppet] - 10https://gerrit.wikimedia.org/r/452321 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:53:29] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2001.codfw.wmnet'] ``` and were **ALL** successful. [08:58:36] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [09:08:42] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10mark) [09:12:04] (03PS2) 10Volans: Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) [09:12:06] (03PS3) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) [09:12:46] (03CR) 10Volans: "replies inline" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:17:11] (03CR) 10Jcrespo: [C: 04-2] "Not to put back into production until recompression is finished." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo) [09:19:30] (03PS7) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [09:21:38] (03PS8) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [09:22:20] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert simple wikis in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) [09:22:22] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 [09:26:08] wow [09:26:58] <_joe_> moritzm: oh you're back [09:27:17] <_joe_> moritzm: i'll need you to review a few apache patches [09:27:43] <_joe_> :D [09:28:10] moritzm: don't do it! [09:29:08] sure, add me to reviewer. looking forward to whatever is lurking in the dark of our apache configs :-) [09:30:01] moritzm: run! [09:30:16] <_joe_> moritzm: I already split all vhosts in single files, now I am first expanding the includes, then switching to use mediawiki::web::vhost [09:30:27] <_joe_> so we can see what's changed via the puppet compiler [09:33:05] not really, the includes are not expanded [09:33:14] I'm expanding them manually locally to have a diff [09:33:25] I'll need another few minutes to review them... [09:37:53] <_joe_> volans: uh? [09:38:07] <_joe_> they are expanded indeed in the change I submitted to you [09:38:48] <_joe_> https://gerrit.wikimedia.org/r/451255 I mean [09:38:54] (03PS2) 10Gehel: elasticsearch: migrate eqiad cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450064 (https://phabricator.wikimedia.org/T193649) [09:39:53] (03CR) 10Gehel: [C: 032] elasticsearch: migrate eqiad cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450064 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [09:40:00] _joe_: exactly, and the compiler expand them for the change and not for the prod one, so to check the diff I need to expand them manually [09:40:51] <_joe_> volans: uh, what do you mean? [09:41:00] <_joe_> https://puppet-compiler.wmflabs.org/compiler03/12051/mw1261.eqiad.wmnet/ I don't see any non-expanded include here [09:41:07] see wikimaniateam.wikimedia.org.conf in https://puppet-compiler.wmflabs.org/compiler03/12051/mw1261.eqiad.wmnet/ [09:41:42] <_joe_> oh wikimaniateam, yeah sorry [09:41:48] <_joe_> I clearly overlooked that [09:41:54] <_joe_> let me do the patch and merge it [09:42:04] <_joe_> we can rebase that one on top of it [09:42:14] ok [09:45:18] (03PS1) 10Jon Harald Søby: Enable Translate extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) [09:46:54] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM, but the syntax looks complex and the generated name 'block_sync-tools_tools-project-backup_tools-project' is a bit confusing." [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [09:49:34] (03PS1) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [09:49:46] <_joe_> volans: review this first :P ^^ [09:49:50] ack [09:53:55] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [09:57:06] _joe_: add the bug number to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451255/ ? [09:59:11] <_joe_> ema: yeah sorry, it's https://phabricator.wikimedia.org/T196968 [10:00:56] 10Operations: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Joe) p:05Triage>03Normal a:03Joe [10:01:54] (03CR) 10Vgutierrez: [C: 032] lvs2007-lvs2010 production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/451607 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [10:01:59] (03PS2) 10Vgutierrez: lvs2007-lvs2010 production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/451607 (https://phabricator.wikimedia.org/T196560) [10:05:50] _joe_: how the intermediate CR should help me with the diff? I still need to add the includes manually to check it :-P [10:06:04] at least is easier to see has no diff [10:06:23] <_joe_> well the inclusion of the includes is done with a script [10:06:35] <_joe_> so it should easy enough to check [10:07:41] (03CR) 10Ema: [C: 031] "One nit, LGTM otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [10:11:42] wait! [10:12:34] volans: ? [10:12:46] upload.wikimedia. [10:13:23] (03CR) 10Volans: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [10:14:03] volans: ah, good catch! [10:16:57] (03PS4) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [10:20:38] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` and were **ALL** successful. [10:23:30] (03PS3) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) [10:27:55] <_joe_> volans: the solution is something you will love [10:28:14] * volans preparing to run [10:29:30] !log upgrade and restart db2036 [10:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:54] (03PS2) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [10:35:08] <_joe_> volans: ^^ [10:35:42] _joe_: domain_suffix should be 'org' [10:35:54] _joe_: sorry to disturb, but does that have anything to do with not creating new wikis each year? [10:36:41] <_joe_> volans: yeah meh just realized [10:36:48] <_joe_> jynus: nope [10:36:53] oh [10:37:00] ok [10:37:20] <_joe_> jynus: this is all in preparation for the php7 migration [10:37:25] thanks [10:37:29] <_joe_> I know it might seem I took the scenic route [10:37:49] <_joe_> but it was the only sane one given the chaos of our apache configs [10:37:54] <_joe_> it was long overdue, even [10:40:32] (03PS3) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [10:50:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:51:10] (03CR) 10Jon Harald Søby: "The deployer should also run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [10:52:03] <_joe_> uhm [10:52:12] <_joe_> can someone look at those memcached errors? [10:52:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:53:03] <_joe_> it seems related to the mcrouter switch tbh [10:53:14] <_joe_> we went from 0 per min to ~ 10 per min constantly [10:54:34] <_joe_> it looks like mw1313 [10:55:53] <_joe_> or well, it seems mw1313 saw mc1035 flapping [10:56:32] <_joe_> same on mw1235 [10:57:09] jan_drewniak: are you doing portals update in a few minutes? looks like the swat and portals window overlap, will talk to my team about that today [10:58:23] oh wow so all this time Common was still not on HHVM? [10:58:25] Commons* [10:58:30] <_joe_> Krenair: no it was [10:58:40] <_joe_> because we have a catchall that sends all .php to hhvm anyways [10:58:41] @zeljkof: yeah, I'll do the portal swat now, but if you could look into changing the schedule that'd be great. [10:58:45] ah [10:58:51] <_joe_> but yeah, subtle bugs [10:58:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:59:13] !log rebooting kraz/irc.wikimedia.org for kernel security update [10:59:22] jan_drewniak: ok, let me know when you are done so I can start swat, I'll check what can be done with the windows [10:59:56] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452343 (https://phabricator.wikimedia.org/T128546) [11:00:03] moritzm, space at the beginning broke it :) [11:00:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T1100). [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T1100). [11:00:05] Jhs: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:27] I'm here! :) [11:00:40] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452343 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:00:44] I can swat today [11:00:58] (03CR) 10Ladsgroup: [C: 031] "ORES service now has it last time I checked :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [11:01:06] Jhs: we'll wait for jan_drewniak to finish portals update, but as far as I remember that does not take long [11:01:11] sure (Y) [11:01:55] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452343 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:01:58] !log rebooting kraz/irc.wikimedia.org for kernel security update [11:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:07] Krenair: doh, good catch. thanks! [11:02:53] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452343 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:03:01] zeljkof, in the meantime, i can fill you in on what might be weirdness: the second patch of mine was actually deployed and reverted. The reason was some missing PNG files, which are in the first patch. I'm not sure if the patch can be deployed again (probably not possible?), or if you can simply(?) revert the revertion [11:03:40] (I'm probably mixing up merge and deploy again, sorry) [11:04:44] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:452343|Bumping portals to master (T128546)]] (duration: 00m 51s) [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:05:02] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:05:35] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:452343|Bumping portals to master (T128546)]] (duration: 00m 50s) [11:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:55] zeljkof: okydoke, done! [11:07:14] jan_drewniak: thanks, starting with swat then! [11:07:23] Jhs: I'll take a look [11:11:16] Jhs: ok, finally looking at patches, so there are three patches for today [11:11:36] zeljkof, yeah! the first two are connected, the third is separate [11:12:07] (03CR) 10Volans: [C: 04-1] "I still don't see it applied correctly, see:" [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:12:10] the second one, 451792 is already merged, that means it's also probably deployed [11:12:37] (merge does not deploy automatically, there's a separate step) [11:13:05] Jhs: ah, there is a revert for 451792 [11:13:23] yes, exactly. so is it possible to revert the revert? :) [11:13:36] Jhs: yes, so I can revert 451810 [11:14:29] Jhs: ok, so first step, merge and deploy 451813 [11:15:00] zeljkof, aye [11:15:18] Jhs: related task is T200152? it's not in the commit message, I can add it, just checking [11:15:18] T200152: Use the correct Pashto Wikivoyage wordmark on mobile site - https://phabricator.wikimedia.org/T200152 [11:15:34] (03CR) 10Volans: "I've done a quick first pass, skipping most of the Reporter's class methods." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [11:16:04] zeljkof, sort of, but not 100 %. But you can add if you want [11:16:48] Jhs: 451813 is a step in resolving T200152, right? [11:16:54] zeljkof, yes [11:17:36] (03PS2) 10Zfilipin: Add missing PNG files in mobile logo folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:18:19] Jhs: also, the first line of the commit message should say what the patch is doing, " Add missing PNG files in mobile logo folder" is too generic [11:18:49] zeljkof, right, sorry [11:18:54] Jhs: "Add missing Pashto Wikivoyage PNG files..." would be better [11:19:08] i can amend now if you like? [11:19:14] Jhs: please do [11:19:37] Jhs: you can edit the commit message from gerrit [11:20:03] (03PS8) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [11:20:05] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: use keytone_host instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) [11:20:14] (03CR) 10Zfilipin: "PS2 adds Phabricator task to the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:21:08] (03CR) 10MarcoAurelio: Enable Translate extension on oldwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:21:48] zeljkof, done [11:22:59] Jhs: um, I don't see the change? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/451813 [11:23:33] (03PS3) 10Jon Harald Søby: Add missing PNG files in mobile logo folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) [11:23:37] zeljkof, ah, forgot to click Publish edit [11:23:38] sorry [11:23:52] Jhs: :D no problem, happens to me all the time [11:24:07] i was seeing it on my screen, so thought it was done :P [11:24:31] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:25:41] Hauskatze, hiya [11:25:52] (03Merged) 10jenkins-bot: Add missing PNG files in mobile logo folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:26:09] Jhs: ah, it's logos for several languages [11:27:07] Jhs: 451813 is at mwdebug, please test and let me know if I can deploy [11:27:53] zeljkof, looks good to me (Y) [11:28:17] Jhs: ok, deploying [11:28:21] (03PS2) 10Jon Harald Søby: Enable Translate extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) [11:29:20] (03CR) 10Jon Harald Søby: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:29:36] !log zfilipin@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: [[gerrit:451813|Add missing PNG files in mobile logo folder (T200152)]] (duration: 00m 49s) [11:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:43] T200152: Use the correct Pashto Wikivoyage wordmark on mobile site - https://phabricator.wikimedia.org/T200152 [11:30:51] Jhs: God morgen, Majestet. Kan jeg hjelpe deg? [11:31:57] Hauskatze, :D nothing specific, no [11:32:05] Jhs: deployed and purged, please check [11:32:28] (03Abandoned) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [11:33:06] (03CR) 10Zfilipin: "Purged: T200152#4498060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:33:31] (03CR) 10MarcoAurelio: [C: 031] Enable Translate extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:33:34] zeljkof, looks good [11:33:48] Jhs: ok, so now for 451792, it's reverted by 451810 [11:34:45] zeljkof, correct [11:34:48] Jhs: so please go to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/451810 and click revert (top right), that will create the revert commit, then edit commit message and add the commit to the calendar, instead of 451792 [11:34:51] (03CR) 10jenkins-bot: Add missing PNG files in mobile logo folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:35:39] Jhs: it's important that the commit message has the line "This reverts commit..." and references a task in phab [11:35:45] Jhs: let me know if you have any questions [11:36:11] Jhs: I'll deploy 452329 while you create the commit [11:36:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:37:15] zeljkof, ok [11:37:32] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [11:38:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:38:35] (03PS1) 10Jon Harald Søby: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) [11:38:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:39:07] (03CR) 10Gehel: [WIP] extract reporting from BaseEventHandler (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [11:40:08] (03PS2) 10Jon Harald Søby: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) [11:40:15] (03Merged) 10jenkins-bot: Enable Translate extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:40:42] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [11:41:02] zeljkof, done [11:42:04] Jhs: great! 452329 is at mwdebug, please test [11:42:21] !log rebooting dbmonitor* hosts for kernel security update [11:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:31] Jhs: (did not run the script yet, will do after deployment) [11:43:37] zeljkof, getting MediaWiki internal error. probably because of missing DB fields from the script [11:43:53] Jhs: should I deploy? or revert? :D [11:44:14] try deploying, running the script, then revert if still broken? [11:44:23] zeljkof, yeah, possibly [11:44:53] the error says "Original exception: [W3FvHQpAAC4AADwJrF0AAAAD] 2018-08-13 11:44:29: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" " [11:45:05] which seems to be what the script is meant to fix [11:45:17] note that the error appears on all pages, not just Translate-extension ones [11:45:48] Jhs: ok, deploying then, scap will stop me if something is broken for real, I hope :D [11:45:55] good luck :D [11:47:03] Jhs: mwdebug logs say `[{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.16/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upd` [11:47:13] so yes, probably just the script needs to run [11:47:32] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:452329|Enable Translate extension on oldwikisource (T201814)]] (duration: 00m 50s) [11:47:32] scap did not complain, it's deploying [11:47:37] and deployed [11:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:38] T201814: Enable translate extension on oldwikisource - https://phabricator.wikimedia.org/T201814 [11:48:10] Table 'sourceswiki.revtag' doesn't exist [11:48:31] jynus: should be fixed now [11:48:44] zeljkof, yay! everything seems to be in order [11:48:48] Jhs: it's deployed, script run done [11:48:51] awesome [11:49:06] (03CR) 10Zfilipin: "Script output T201814#4498090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:49:27] Remember the Alamo maintenance scripts [11:49:28] (03CR) 10Arturo Borrero Gonzalez: "Compiler seems OK:" [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [11:49:39] Jhs: ok, now 452347 [11:49:49] zeljkof, sure [11:50:09] (03CR) 10Arturo Borrero Gonzalez: "I uploaded another patch with your request Andrew." [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [11:50:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:50:48] (03CR) 10jenkins-bot: Enable Translate extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452329 (https://phabricator.wikimedia.org/T201814) (owner: 10Jon Harald Søby) [11:51:33] (03CR) 10Zfilipin: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:51:38] (03PS3) 10Zfilipin: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:51:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:52:18] oh I hate gerrit's silent (merge confict) message when +2ing something [11:52:58] (03Merged) 10jenkins-bot: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [11:54:41] Jhs: 452347 is at mwdebug1002 [11:56:51] zeljkof, checking [11:57:24] (03PS2) 10Legoktm: Add php72 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) [11:57:29] zeljkof, works correctly, yes [11:57:36] Jhs: ok, deploying [11:58:20] (03CR) 10Legoktm: [C: 032] Add base stretch image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm) [11:58:33] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:452347|Set mobile wordmark for pswikivoyage (T200152)]] (duration: 00m 52s) [11:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:40] T200152: Use the correct Pashto Wikivoyage wordmark on mobile site - https://phabricator.wikimedia.org/T200152 [11:58:42] (03Merged) 10jenkins-bot: Add base stretch image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm) [11:59:04] Jhs: deployed! please check and thanks for deploying with #releng! :D [12:00:00] !log rebooting cloudcontrol* for kernel security update [12:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] zeljkof, yup, all good :D thanks to you too! [12:01:32] PROBLEM - Host cloudcontrol1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:49] sorry, cloudcontrol1003 is me, fixing downtime [12:03:22] RECOVERY - Host cloudcontrol1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:04:38] 10Operations, 10ORES, 10Scoring-platform-team (Current): Spin up a new poolcounter node for ores - https://phabricator.wikimedia.org/T201824 (10Ladsgroup) p:05Triage>03High [12:06:33] !log EU SWAT finished [12:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:55] (03CR) 10Legoktm: [C: 032] Add php72 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [12:07:16] (03Merged) 10jenkins-bot: Add php72 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [12:24:39] !log uploaded hhvm-wikidiff2 1.7.2 to apt.wikimedia.org (source package name is still php-wikidiff2 for historical reasons) (T199801) [12:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:45] T199801: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 [12:28:19] !log upgrading mwdebug* servers to wikidiff 1.7.2 [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:25] !log upgrading mwdebug* servers to wikidiff 1.7.2 (T199801) [12:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:46] (03PS1) 10Ema: ATS: move verify_config to ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/452351 (https://phabricator.wikimedia.org/T199720) [12:34:53] (03CR) 10Ema: [C: 032] ATS: move verify_config to ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/452351 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:35:46] (03CR) 10jenkins-bot: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452347 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [12:39:06] (03PS1) 10Legoktm: Add support for php7.2 image/backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452353 (https://phabricator.wikimedia.org/T188318) [12:39:32] (03PS2) 10Legoktm: Add support for php7.2 image/backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452353 (https://phabricator.wikimedia.org/T188318) [12:40:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:49:11] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:53:52] !log upgrading mw1261 servers to wikidiff 1.7.2 (T199801) [12:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:58] T199801: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 [12:54:14] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10BBlack) >>! In T164609#4497768, @ema wrote: >>>! In T164609#4483549, @Joe wrote: >> Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit;... [12:54:49] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) >>! In T201772#4497359, @TheDJ wrote: > Why do we cache 24 hours ? That seems like a lot for clients to cache. 1 hour would seem more than sufficient shouldn't it ? varnish co... [12:55:32] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) 05Open>03Resolved a:03Gehel Resolving as this specific issue is now OK. [12:56:27] !log reimaging of elasticsearch / cirrus / codfw cluster (RAID0 / Stretch) completed - T193649 / T198391 [12:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:35] T198391: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 [12:56:35] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [12:56:45] !log start reimaging of elasticsearch / cirrus / eqiad cluster (RAID0 / Stretch) - T193649 / T198391 [12:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:57:38] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1017.eqiad.wmnet'] ``` The log... [13:01:34] (03PS5) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [13:01:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:07:02] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169 (10Gehel) [13:09:10] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169 (10Gehel) [13:14:56] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:17:56] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:19:35] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1017.eqiad.wmnet'] ``` and were **ALL** successful. [13:23:04] 10Operations, 10ops-codfw: restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 (10Eevans) > the system is back up but I don't intend to put it back into rotation until hardware is inspected. I guess out of rotation in this context probably means RESTBase(?); It appears Cassan... [13:23:27] 10Operations, 10ops-codfw, 10Services (watching): restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 (10Eevans) [13:25:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1043.eqiad.wmnet', 'elastic1051... [13:25:25] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1043.eqiad.wmnet', 'elastic1052.eqiad.wmnet', 'elastic1051.eqiad.wmnet'] ``` Of... [13:26:06] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:26:33] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1043.eqiad.wmnet', 'elastic1051... [13:28:06] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:29:32] (03PS4) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [13:32:08] PROBLEM - Check systemd state on elastic1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:08] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:09] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:09] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:09] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:09] PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:19] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:19] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:24] gehel: ^ [13:32:28] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:28] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:39] PROBLEM - Check systemd state on elastic1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:42] damn, not looking good [13:32:49] PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:32:50] gehel: mjolnir-kafka-bulk-daemon.service [13:33:08] PROBLEM - Check systemd state on elastic1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:09] PROBLEM - Check systemd state on elastic1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:09] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:18] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:19] ok, so minor, but why did it fail on all those hosts while I was reimaging others, checking [13:33:19] PROBLEM - Check systemd state on elastic1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:28] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:29] PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:38] PROBLEM - Check systemd state on elastic1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:39] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:39] PROBLEM - Check systemd state on elastic1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:49] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:49] PROBLEM - Check systemd state on elastic1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:59] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:59] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:59] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:02] ReadTimeoutError apparently [13:34:09] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:22] to localhost? that seems weird [13:34:28] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:28] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:32] yes [13:34:34] 9200 [13:34:38] RECOVERY - Check systemd state on elastic1031 is OK: OK - running: The system is fully operational [13:34:39] RECOVERY - Check systemd state on elastic1020 is OK: OK - running: The system is fully operational [13:35:00] slowdown of elasticsearch while shards were moving around? checking [13:35:19] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational [13:36:08] yeah, there was a short rise in response times, probably related [13:36:09] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [13:36:28] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [13:36:29] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational [13:36:48] RECOVERY - Check systemd state on elastic1028 is OK: OK - running: The system is fully operational [13:37:31] looks like the cluster is much more sensitive to a few hosts down than it was during previous cluster restarts. Might be unrelated, but the timing correlate too wel. [13:38:31] there was a rise in response times during ~3 minutes, the nodes are not yet finished with reimage yet, strange [13:38:39] RECOVERY - Check systemd state on elastic1029 is OK: OK - running: The system is fully operational [13:38:59] all those errors should already be recovering [13:38:59] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational [13:39:59] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational [13:41:07] gehel: could be that after the reimage the host joins the cluster before it should during the first puppet run? [13:41:26] nope, they just started the reimage process, they are far from done [13:41:29] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational [13:42:04] i wonder, do those alert on first fail or does it at least restart it once before declaring a loss? [13:42:08] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational [13:42:21] it looks more like they left the cluster in a hurry (power reset during reimage) and the master was swamped with cluster state changes [13:42:29] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [13:42:38] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational [13:42:38] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational [13:42:59] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [13:43:43] gehel: I can add the --stop option to the reimage script if needed [13:44:19] RECOVERY - Check systemd state on elastic1023 is OK: OK - running: The system is fully operational [13:44:29] volans: at some point, that would be nice, but the cluster should be more stable than that to loosing a few nodes, still trying to understand why it had as much impact as it had [13:44:45] ack [13:45:29] we had a bunch of relocations going on at the same time, probably did not help [13:45:51] indeed, the weird thing here is that many machines all timed out. I could see one or two somehow, but to have the whole cluster stall out (also seen in latency graphs) i dunno.. [13:45:52] PROBLEM - Disk space on elastic1046 is CRITICAL: DISK CRITICAL - free space: /srv 73367 MB (10% inode=99%) [13:46:12] RECOVERY - Check systemd state on elastic1030 is OK: OK - running: The system is fully operational [13:46:23] (03PS9) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [13:46:25] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: use keytone_host instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) [13:46:29] (03CR) 10Andrew Bogott: [C: 031] "Thanks for doing this! As long as the puppet compiler is happy, I'm happy." [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [13:47:00] mostly unrelated, but we publish source info (like line number) in logstash. That's probably expensive! [13:47:33] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational [13:48:48] ebernhardson: some sort of cluster consensus/election that locked the metadata? [13:49:52] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [13:51:13] RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational [13:51:32] RECOVERY - Check systemd state on elastic1022 is OK: OK - running: The system is fully operational [13:51:52] PROBLEM - Disk space on elastic1046 is CRITICAL: DISK CRITICAL - free space: /srv 72921 MB (10% inode=99%) [13:52:23] RECOVERY - Check systemd state on elastic1019 is OK: OK - running: The system is fully operational [13:53:42] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1043.eqiad.wmnet', 'elastic1051.eqiad.wmnet', 'elastic1052.eqiad.wmnet'] ``` an... [13:55:50] volans: seems plausible, but i didn't realize elasticsearch even had global locks. Usually nodes are pretty happy to go on serving search requests even partitioned from the network (they are configured to allow that) [13:56:33] RECOVERY - Check systemd state on elastic1047 is OK: OK - running: The system is fully operational [13:56:33] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational [13:56:33] RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational [13:56:53] RECOVERY - Disk space on elastic1046 is OK: DISK OK [13:57:05] (03PS1) 10WMDE-leszek: Wikidata: Use new item ID formatter for Q1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) [13:57:18] or maybe the master node got a hiccup? [13:57:22] (03CR) 10WMDE-leszek: [C: 04-1] Wikidata: Use new item ID formatter for Q1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [13:57:33] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational [13:57:37] volans: best guess so far [13:57:48] ack [13:57:52] RECOVERY - Check systemd state on elastic1024 is OK: OK - running: The system is fully operational [13:59:32] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [13:59:43] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [13:59:52] RECOVERY - Check systemd state on elastic1018 is OK: OK - running: The system is fully operational [14:00:20] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12059/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:00:33] (03PS5) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [14:00:35] _joe_: love your final solution :D [14:00:39] RECOVERY - Check systemd state on elastic1025 is OK: OK - running: The system is fully operational [14:00:57] (03PS1) 10Andrew Bogott: Horizon: disable during keystone switchover [puppet] - 10https://gerrit.wikimedia.org/r/452367 (https://phabricator.wikimedia.org/T201504) [14:01:13] <_joe_> volans: yeah otoh all wikis moved to the ::vhost define are lacking math redirect until this one [14:01:17] <_joe_> they were all private wikis [14:01:23] <_joe_> but I need to add the parameter [14:01:26] ack [14:01:44] (03CR) 10Andrew Bogott: [C: 032] Horizon: disable during keystone switchover [puppet] - 10https://gerrit.wikimedia.org/r/452367 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [14:03:52] (03PS2) 10Ottomata: Add cron job to create and rotate EventLogging salts [puppet] - 10https://gerrit.wikimedia.org/r/451780 (https://phabricator.wikimedia.org/T199899) (owner: 10Mforns) [14:03:57] (03CR) 10Ottomata: [V: 032 C: 032] Add cron job to create and rotate EventLogging salts [puppet] - 10https://gerrit.wikimedia.org/r/451780 (https://phabricator.wikimedia.org/T199899) (owner: 10Mforns) [14:04:27] FYI we are doing some operations in cloudvps so some stuff may alert. I downtimed a lot of stuff, but still, some server may complain [14:05:15] <_joe_> ottomata: mean! [14:05:24] (03PS6) 10Giuseppe Lavagetto: wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) [14:05:32] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] wikimaniateam: expand all includes preparing for transition [puppet] - 10https://gerrit.wikimedia.org/r/452330 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:05:36] !log upgrading mw1262-1265 to wikidiff 1.7.2 (T199801) [14:05:43] _joe_: ehhh? [14:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:44] T199801: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 [14:05:58] <_joe_> ottomata: you V+2C+2 [14:06:03] oh ya after a rebase [14:06:08] <_joe_> while I was waiting patiently for jenkins [14:06:09] <_joe_> :D [14:06:11] haha [14:08:08] (03PS10) 10Andrew Bogott: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [14:09:48] !log stopping nodepool, downtiming horizon for T201504 [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] T201504: cloudvps: main/eqiad1 keystone merge - https://phabricator.wikimedia.org/T201504 [14:10:48] !log T201504 disable keystone in main and eqiad1 deployments, all has been downtimed in icinga [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:00] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [14:12:23] (03CR) 10Andrew Bogott: [C: 032] cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [14:12:39] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [14:12:42] (03PS3) 10Andrew Bogott: cloudvps: use keytone_host instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [14:13:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission uranium/WMF3128 - https://phabricator.wikimedia.org/T191348 (10Cmjohnson) [14:13:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [14:13:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission uranium/WMF3128 - https://phabricator.wikimedia.org/T191348 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [14:14:31] (03CR) 10Andrew Bogott: [C: 032] cloudvps: use keytone_host instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452345 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [14:16:21] (03PS1) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [14:16:40] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [14:18:08] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: add ServerAlias support [puppet] - 10https://gerrit.wikimedia.org/r/451256 [14:18:10] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: add the domain_suffix variable [puppet] - 10https://gerrit.wikimedia.org/r/452374 (https://phabricator.wikimedia.org/T196968) [14:19:49] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [14:20:48] (03PS1) 10Cmjohnson: Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452377 (https://phabricator.wikimedia.org/T197630) [14:21:03] (03CR) 10jerkins-bot: [V: 04-1] Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452377 (https://phabricator.wikimedia.org/T197630) (owner: 10Cmjohnson) [14:21:09] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [14:21:30] (03PS11) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [14:21:32] (03PS1) 10Volans: dry-run: add set_dry_run function [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) [14:21:34] (03PS1) 10Volans: Log: use local variable for dry_run [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) [14:21:39] (03CR) 10Volans: "Refactor done, I need to complete the tests but sending it out to allow a first review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:21:49] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [14:22:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12060/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452374 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:25:32] (03PS2) 10Cmjohnson: Adding mgmt dns for several new servers [dns] - 10https://gerrit.wikimedia.org/r/451680 (https://phabricator.wikimedia.org/T201343) [14:25:45] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for several new servers [dns] - 10https://gerrit.wikimedia.org/r/451680 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson) [14:25:52] !log installing dpkg updates from stretch point release [14:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:25] 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930 (10jcrespo) [14:28:46] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10MusikAnimal) Thank you, all! [14:29:45] (03PS2) 10Cmjohnson: Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452377 (https://phabricator.wikimedia.org/T197630) [14:29:58] (03CR) 10jerkins-bot: [V: 04-1] Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452377 (https://phabricator.wikimedia.org/T197630) (owner: 10Cmjohnson) [14:30:26] 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolving this meta-ticket. With the introduction of ROW-based replication before filterin, no recurri... [14:31:08] (03Abandoned) 10Cmjohnson: Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452377 (https://phabricator.wikimedia.org/T197630) (owner: 10Cmjohnson) [14:33:41] (03PS2) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [14:33:43] (03PS1) 10Andrew Bogott: keystone: pass the actual keystone host to openstack::util::envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452381 (https://phabricator.wikimedia.org/T201504) [14:34:01] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12061/mw1261.eqiad.wmnet/ only whitespace diffs." [puppet] - 10https://gerrit.wikimedia.org/r/451256 (owner: 10Giuseppe Lavagetto) [14:34:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/452381 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [14:34:44] (03PS2) 10Arturo Borrero Gonzalez: keystone: pass the actual keystone host to openstack::util::envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452381 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [14:34:48] (03Abandoned) 10Cmjohnson: Adding mgmt dns for several new servers [dns] - 10https://gerrit.wikimedia.org/r/451680 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson) [14:38:18] (03PS5) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [14:40:30] (03PS3) 10Ottomata: Add cron job to create and rotate EventLogging salts [puppet] - 10https://gerrit.wikimedia.org/r/451780 (https://phabricator.wikimedia.org/T199899) (owner: 10Mforns) [14:40:36] (03CR) 10Ottomata: [V: 032 C: 032] Add cron job to create and rotate EventLogging salts [puppet] - 10https://gerrit.wikimedia.org/r/451780 (https://phabricator.wikimedia.org/T199899) (owner: 10Mforns) [14:41:08] (03PS1) 10Cmjohnson: Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452383 (https://phabricator.wikimedia.org/T197630) [14:41:57] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom host samarium [dns] - 10https://gerrit.wikimedia.org/r/452383 (https://phabricator.wikimedia.org/T197630) (owner: 10Cmjohnson) [14:43:17] (03CR) 10Bstorm: "> Patch Set 9: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [14:46:10] 10Operations, 10ops-eqiad, 10Patch-For-Review: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10Cmjohnson) Removed the dns entries. Still needs to be removed from the switch. @ayounsi can you remove samarium from the fundraising switch. [14:47:39] !log disabling puppet and shutting down db1052 for final decommissioning [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:42] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 356 bytes in 60.004 second response time [14:51:09] !log installing ghostscript security updates [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] (03PS1) 10Cmjohnson: Removing db1052 from site.pp final decommission [puppet] - 10https://gerrit.wikimedia.org/r/452385 (https://phabricator.wikimedia.org/T199861) [14:52:39] (03PS2) 10Cmjohnson: Removing db1052 from site.pp final decommission [puppet] - 10https://gerrit.wikimedia.org/r/452385 (https://phabricator.wikimedia.org/T199861) [14:53:07] (03CR) 10Jcrespo: [C: 031] Removing db1052 from site.pp final decommission [puppet] - 10https://gerrit.wikimedia.org/r/452385 (https://phabricator.wikimedia.org/T199861) (owner: 10Cmjohnson) [14:54:14] (03CR) 10Cmjohnson: [C: 032] Removing db1052 from site.pp final decommission [puppet] - 10https://gerrit.wikimedia.org/r/452385 (https://phabricator.wikimedia.org/T199861) (owner: 10Cmjohnson) [14:55:47] (03CR) 10BBlack: [C: 031] varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [14:56:51] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: fix reference to undef variable in private-https [puppet] - 10https://gerrit.wikimedia.org/r/452388 [14:57:16] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: fix reference to undef variable in private-https [puppet] - 10https://gerrit.wikimedia.org/r/452388 (owner: 10Giuseppe Lavagetto) [14:58:40] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: fix reference to undef variable in private-https [puppet] - 10https://gerrit.wikimedia.org/r/452388 [14:59:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki::web: fix reference to undef variable in private-https [puppet] - 10https://gerrit.wikimedia.org/r/452388 (owner: 10Giuseppe Lavagetto) [15:00:21] !log otto@deploy1001 Started deploy [analytics/refinery@9006a4e]: refinery changes for T198908 [15:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:27] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [15:02:29] (03PS10) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [15:03:17] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [15:04:03] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:04:23] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:04:32] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:04:42] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:06:41] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll add this to the existing account consistency check. [15:07:23] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:07:32] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:08:46] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: main: restore envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) [15:09:20] (03PS2) 10Arturo Borrero Gonzalez: cloud vps: main: restore envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) [15:09:25] (03CR) 10jerkins-bot: [V: 04-1] cloud vps: main: restore envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [15:09:36] (03PS11) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [15:09:54] (03CR) 10jerkins-bot: [V: 04-1] cloud vps: main: restore envscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [15:10:03] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 50930 MB (10% inode=99%) [15:10:38] !log otto@deploy1001 Finished deploy [analytics/refinery@9006a4e]: refinery changes for T198908 (duration: 10m 30s) [15:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:45] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [15:11:32] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:13:10] andrewbogott, arturo: traffic-puppetmaster is currently broken with "Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::openstack::main::keystone_host in any Hiera data file and no default supplied [...]" [15:13:12] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 51777 MB (10% inode=99%) [15:13:21] (03PS3) 10Arturo Borrero Gonzalez: cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) [15:13:24] andrewbogott, arturo: is that a known issue? [15:13:32] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:13:33] unexpected [15:13:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:14:02] (03CR) 10jerkins-bot: [V: 04-1] cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [15:14:13] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:14:14] ema: but I see that [15:14:16] https://www.irccloud.com/pastebin/WaMNAGNl/ [15:14:32] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:14:33] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:14:58] 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T201761 (10Papaul) a:05Papaul>03jcrespo Disk replaced [15:16:13] (03PS3) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [15:16:15] (03PS1) 10Andrew Bogott: Pass keystone_host to profile::openstack::main::nova::common [puppet] - 10https://gerrit.wikimedia.org/r/452393 (https://phabricator.wikimedia.org/T201504) [15:16:43] (03PS1) 10Cmjohnson: Removing dns entries decom host db1052 [dns] - 10https://gerrit.wikimedia.org/r/452394 (https://phabricator.wikimedia.org/T199861) [15:17:13] (03PS4) 10Arturo Borrero Gonzalez: cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) [15:17:15] (03CR) 10Rush: [C: 031] Pass keystone_host to profile::openstack::main::nova::common [puppet] - 10https://gerrit.wikimedia.org/r/452393 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [15:17:22] (03CR) 10Cmjohnson: [C: 032] Removing dns entries decom host db1052 [dns] - 10https://gerrit.wikimedia.org/r/452394 (https://phabricator.wikimedia.org/T199861) (owner: 10Cmjohnson) [15:17:25] (03CR) 10Andrew Bogott: [C: 032] Pass keystone_host to profile::openstack::main::nova::common [puppet] - 10https://gerrit.wikimedia.org/r/452393 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [15:17:32] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Pass keystone_host to profile::openstack::main::nova::common [puppet] - 10https://gerrit.wikimedia.org/r/452393 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [15:18:40] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1052 - https://phabricator.wikimedia.org/T199861 (10Cmjohnson) [15:18:49] (03PS5) 10Arturo Borrero Gonzalez: cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) [15:19:45] (03CR) 10Andrew Bogott: [C: 031] cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [15:20:02] 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Joe) p:05Triage>03Normal [15:20:02] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: main: restore envscripts and adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/452389 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [15:25:32] RECOVERY - Device not healthy -SMART- on db2039 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2039&var-datasource=codfw%2520prometheus%252Fops [15:27:55] (03PS12) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [15:30:46] (03CR) 10Bstorm: "This makes the service names less stupid, but preserves the information we want in a comment in the unit file. Also, I'll be damned if I'" [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [15:31:20] (03PS1) 10Cmjohnson: Removing puppet entries from decom host lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/452397 (https://phabricator.wikimedia.org/T191360) [15:31:31] 10Operations, 10ops-eqiad, 10Patch-For-Review: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10ayounsi) 05Open>03Resolved a:03ayounsi Removed! [15:31:52] (03CR) 10jerkins-bot: [V: 04-1] Removing puppet entries from decom host lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/452397 (https://phabricator.wikimedia.org/T191360) (owner: 10Cmjohnson) [15:33:31] !log otto@deploy1001 Started deploy [analytics/refinery@a051125]: fix for T198908 [15:33:34] !log shutting down lvs2009 to disable on board NICs [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:41] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [15:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:10] (03Abandoned) 10Cmjohnson: Removing puppet entries from decom host lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/452397 (https://phabricator.wikimedia.org/T191360) (owner: 10Cmjohnson) [15:34:14] (03PS13) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [15:34:52] (03PS4) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [15:34:54] (03PS1) 10Andrew Bogott: nova-fullstack: specify region from hiera [puppet] - 10https://gerrit.wikimedia.org/r/452400 [15:34:56] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [15:35:18] (03PS14) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [15:35:23] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:36] <_joe_> uh? [15:35:42] wut? [15:36:02] RECOVERY - Disk space on elastic1023 is OK: DISK OK [15:36:02] (03CR) 10Arturo Borrero Gonzalez: [C: 031] nova-fullstack: specify region from hiera [puppet] - 10https://gerrit.wikimedia.org/r/452400 (owner: 10Andrew Bogott) [15:36:09] 2009 isn't in production, in any case [15:36:59] 2001-6 are existing, 2007-10 are new ones still being provisioned afaik [15:37:00] (03PS1) 10Cmjohnson: Removing puppet entries decom host lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/452401 (https://phabricator.wikimedia.org/T191360) [15:37:21] yeah, not in site.pp [15:39:13] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:13] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:18] (03CR) 10Cmjohnson: [C: 032] Removing puppet entries decom host lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/452401 (https://phabricator.wikimedia.org/T191360) (owner: 10Cmjohnson) [15:39:33] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:33] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:33] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:52] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:40:23] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) on boad NICs disable [15:41:02] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:41:26] (03PS1) 10Cmjohnson: Removing dns for decom host lawrencium [dns] - 10https://gerrit.wikimedia.org/r/452403 (https://phabricator.wikimedia.org/T191360) [15:41:37] !log otto@deploy1001 Finished deploy [analytics/refinery@a051125]: fix for T198908 (duration: 08m 06s) [15:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:43] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [15:42:05] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/12063/mw1261.eqiad.wmnet/ is the new diff." [puppet] - 10https://gerrit.wikimedia.org/r/451255 (owner: 10Giuseppe Lavagetto) [15:43:03] (03CR) 10Cmjohnson: [C: 032] Removing dns for decom host lawrencium [dns] - 10https://gerrit.wikimedia.org/r/452403 (https://phabricator.wikimedia.org/T191360) (owner: 10Cmjohnson) [15:44:43] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:44:43] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:44:52] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:44:57] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler02/12065/" [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [15:45:02] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:45:08] (03PS2) 10Andrew Bogott: nova-fullstack: specify region from hiera [puppet] - 10https://gerrit.wikimedia.org/r/452400 [15:45:10] (03PS5) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [15:45:12] (03PS1) 10Andrew Bogott: puppet validatelabsfqdn: hardcode region to 'eqiad' [puppet] - 10https://gerrit.wikimedia.org/r/452404 (https://phabricator.wikimedia.org/T201423) [15:45:21] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360 (10Cmjohnson) [15:45:32] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:45:32] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:46:46] (03CR) 10Andrew Bogott: [C: 032] puppet validatelabsfqdn: hardcode region to 'eqiad' [puppet] - 10https://gerrit.wikimedia.org/r/452404 (https://phabricator.wikimedia.org/T201423) (owner: 10Andrew Bogott) [15:47:15] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/12066/ Compiler is happy" [puppet] - 10https://gerrit.wikimedia.org/r/452400 (owner: 10Andrew Bogott) [15:49:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team: Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Cmjohnson) @chasemp @Bstorm What is the status of these? Can the decom process continue? Thanks [15:51:51] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission niobium - https://phabricator.wikimedia.org/T181763 (10Cmjohnson) [15:52:04] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission niobium - https://phabricator.wikimedia.org/T181763 (10Cmjohnson) 05Open>03Resolved [15:52:35] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360 (10Cmjohnson) [15:52:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [15:52:45] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [15:53:24] (03PS6) 10Andrew Bogott: Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 [15:53:26] (03PS1) 10Andrew Bogott: define profile::openstack::main::keystone_host for cloud VMs. [puppet] - 10https://gerrit.wikimedia.org/r/452406 [15:55:14] (03CR) 10Andrew Bogott: [C: 032] define profile::openstack::main::keystone_host for cloud VMs. [puppet] - 10https://gerrit.wikimedia.org/r/452406 (owner: 10Andrew Bogott) [15:56:37] (03PS1) 10Brian Wolff: Adjust CSP settings. For this stage allow inline but restrict src [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452407 (https://phabricator.wikimedia.org/T135963) [15:58:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom spare server osmium/wmf4546 - https://phabricator.wikimedia.org/T191364 (10Cmjohnson) [15:58:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [15:59:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom spare server osmium/wmf4546 - https://phabricator.wikimedia.org/T191364 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [15:59:41] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) Disk wipe in progress [16:00:01] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) [16:05:03] (03CR) 10Andrew Bogott: [C: 032] Revert "Horizon: disable during keystone switchover" [puppet] - 10https://gerrit.wikimedia.org/r/452373 (owner: 10Andrew Bogott) [16:05:17] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169 (10RobH) a:03RobH [16:18:24] (03CR) 10Krinkle: mediawiki::web::prod_sites: convert loginwiki, chapterwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [16:22:12] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:23:49] (03PS1) 10Andrew Bogott: nova_fullstack test: pass in region_name to nova client [puppet] - 10https://gerrit.wikimedia.org/r/452411 [16:24:58] (03CR) 10Andrew Bogott: [C: 032] nova_fullstack test: pass in region_name to nova client [puppet] - 10https://gerrit.wikimedia.org/r/452411 (owner: 10Andrew Bogott) [16:28:40] (03PS4) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [16:29:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [16:32:22] (03PS5) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [16:33:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [16:35:13] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) p:05Normal>03High a:03Imarlier [16:36:53] (03PS1) 10Andrew Bogott: keystone policy: allow any user to list available regions [puppet] - 10https://gerrit.wikimedia.org/r/452414 [16:37:58] (03CR) 10Andrew Bogott: [C: 032] keystone policy: allow any user to list available regions [puppet] - 10https://gerrit.wikimedia.org/r/452414 (owner: 10Andrew Bogott) [16:44:35] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10thcipriani) [16:46:18] (03PS1) 10Andrew Bogott: validatelabsfqdn.py: Make multi-region aware [puppet] - 10https://gerrit.wikimedia.org/r/452418 (https://phabricator.wikimedia.org/T201423) [16:48:57] (03PS2) 10Andrew Bogott: validatelabsfqdn.py: Make multi-region aware [puppet] - 10https://gerrit.wikimedia.org/r/452418 (https://phabricator.wikimedia.org/T201423) [16:50:08] (03CR) 10Andrew Bogott: [C: 032] validatelabsfqdn.py: Make multi-region aware [puppet] - 10https://gerrit.wikimedia.org/r/452418 (https://phabricator.wikimedia.org/T201423) (owner: 10Andrew Bogott) [16:51:35] 10Operations, 10Beta-Cluster-Infrastructure, 10Jenkins, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10thcipriani) [16:52:03] 10Operations, 10Puppet, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Someday): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10thcipriani) [16:52:43] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:57:33] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10BBlack) The incident report needs some deeper updates (will work on that today), but it's almost certainly related to https://wikit... [16:58:07] gehel: for some reason my wikitech login is not working, so I can't update deploy schedule [16:58:28] gehel: could you do re-deploy of GUI and other stuff, as usual? [17:00:04] gehel: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T1700). [17:00:36] SMalyshev: we are looking into it [17:00:59] SMalyshev: on it! [17:01:20] arturo: ah, so it's a known issue? cool, will wait [17:01:37] !log gehel@deploy1001 Started deploy [wdqs/wdqs@a62b1ac]: new version of wdqs GUI (wdqs1009 only) [17:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:53] SMalyshev: we did some maintenance a couple of hours ago and wikitech could be affected [17:02:07] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a62b1ac]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 30s) [17:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:47] arturo: for me it waits for a while on login (way longer than usual), and then rejects. I checked same login, works for gerrit & phab, so something between ldap & wikitech I suspect [17:03:09] makes sense [17:03:20] gehel: also whitelist.txt needs updating (in case it doesn't happen automatically) [17:04:14] SMalyshev: deployed on wdqs1009, can you check the whitelist? Not sure how to do it. [17:04:31] SMalyshev: would that require a restart of blazegraph (which is done automatically) [17:04:47] o/ are there folks online now that can help me out with what is left needing to be done in order to get this into production? https://phabricator.wikimedia.org/T186748 [17:05:06] gehel: yes, updating whitelist requires blazegraph restart [17:05:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1047.eqiad.wmnet', 'elastic1049... [17:05:25] gehel: whitelist is fine on wdq9 [17:05:29] (03PS1) 10Andrew Bogott: wikitech.php: update keystone URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452422 [17:05:50] SMalyshev: tests are green, rolling out to other nodes [17:05:52] !log gehel@deploy1001 Started deploy [wdqs/wdqs@a62b1ac]: new version of wdqs GUI [17:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:14] 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) - approved in SRE meeting [17:07:21] jouncebot, next [17:07:22] In 0 hour(s) and 52 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T1800) [17:07:33] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:08:32] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:09:23] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:09:23] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:09:42] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [17:09:43] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [17:10:12] elasticsearch is slow again, 3 reiamged nodes will rejoin the cluster in a moment, which should help [17:10:23] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [17:10:32] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [17:10:37] I've had that feeling already. [17:10:48] (03CR) 10Thcipriani: [C: 032] wikitech.php: update keystone URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452422 (owner: 10Andrew Bogott) [17:11:12] (03PS2) 10Imarlier: dumps: Datahub has moved [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) [17:12:10] SMalyshev: is wikitech better now? [17:12:12] (03Merged) 10jenkins-bot: wikitech.php: update keystone URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452422 (owner: 10Andrew Bogott) [17:12:57] (03CR) 10Imarlier: "Assigning to Ariel -- I don't have +2 on the puppet repo, hoping that you might be able to merge..." [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [17:13:13] andrewbogott: staging your patch on deployment server, will star sync in one second [17:13:15] *start [17:13:55] SMalyshev: ok, actually, wait 5 minutes before answering that then :) [17:13:59] thanks thcipriani [17:14:28] (03PS1) 10Andrew Bogott: nfs-exportd: gather up IPs from all regions [puppet] - 10https://gerrit.wikimedia.org/r/452425 [17:14:32] (03PS15) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [17:15:09] !log thcipriani@deploy1001 Synchronized wmf-config/wikitech.php: [[gerrit:452422|wikitech.php: update keystone URLs]] (duration: 00m 53s) [17:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:14] andrewbogott: ^ live now [17:16:01] thcipriani: works for me! [17:16:02] RECOVERY - HP RAID on db2039 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [17:16:02] thanks again [17:16:09] yw :) [17:16:37] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) @BBlack That was my assumption as well. I want to verify that the agents in AWS are, in fact, routing to codfw, but that... [17:16:41] 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) [17:16:46] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a62b1ac]: new version of wdqs GUI (duration: 10m 54s) [17:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:26] (03CR) 10ArielGlenn: [C: 032] dumps: Datahub has moved [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [17:17:27] SMalyshev: deployment completed, tests are green [17:18:11] gehel: great, thanks! [17:18:14] 10Operations: onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201855 (10Dzahn) [17:20:02] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] Gehel seems related to reimage in progress, keeping an eye on it and will switch to codfw if required https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:21:52] (03PS2) 10Bstorm: nfs-exportd: gather up IPs from all regions [puppet] - 10https://gerrit.wikimedia.org/r/452425 (owner: 10Andrew Bogott) [17:23:03] (03CR) 10Bstorm: [C: 032] nfs-exportd: gather up IPs from all regions [puppet] - 10https://gerrit.wikimedia.org/r/452425 (owner: 10Andrew Bogott) [17:23:53] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:23:57] (03CR) 10jenkins-bot: wikitech.php: update keystone URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452422 (owner: 10Andrew Bogott) [17:24:59] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [17:25:19] /away [17:25:22] nope [17:27:00] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:30:30] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:31:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:31:55] (03PS1) 10Bstorm: nfs-exportd: gather up IPs from all regions [puppet] - 10https://gerrit.wikimedia.org/r/452426 [17:32:26] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1047.eqiad.wmnet', 'elastic1050.eqiad.wmnet', 'elastic1049.eqiad.wmnet'] ``` an... [17:32:45] (03CR) 10Bstorm: [C: 032] nfs-exportd: gather up IPs from all regions [puppet] - 10https://gerrit.wikimedia.org/r/452426 (owner: 10Bstorm) [17:33:30] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:33:53] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10jcrespo) So db1066 is the s2 eqiad master active, so any downtime there means the s2 wikis go read only: https://noc.wikimedia.org/db.php#tabs-2 [17:33:57] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10fgiunchedi) re: ms-be1040 it can be moved back to the old switch any time [17:34:06] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:34:12] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:34:52] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [17:34:58] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) [17:35:15] (03CR) 10Zhuyifei1999: [C: 031] "Feel free to merge once the image is built. Also, after building the deb all the docker images will need also a rebuild to get the newer v" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452353 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [17:35:21] (03CR) 10jerkins-bot: [V: 04-1] labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [17:35:22] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:35:32] 10Operations, 10SRE-Access-Requests: Please add me to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10mepps) [17:36:35] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) [17:36:53] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:37:08] (03CR) 10jerkins-bot: [V: 04-1] labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [17:38:22] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [17:38:42] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:38:43] (03CR) 10Alex Monk: "are region names equal to region IDs?" [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [17:39:10] (03PS3) 10Andrew Bogott: labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) [17:39:27] (03PS1) 10Bstorm: nfs-exportd: correcting typo [puppet] - 10https://gerrit.wikimedia.org/r/452428 [17:39:42] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [17:39:50] (03CR) 10Andrew Bogott: "As far as I can tell, region_name is just a stupid argument name. Regions don't seem to even have a 'name' in the underlying objects." [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [17:40:02] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [17:40:13] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452427 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [17:40:52] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [17:42:07] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) a:03Dzahn [17:42:26] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) a:03Dzahn [17:43:51] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) This request has been neither approved nor denied but set to "needs more discussion" in SRE meeting. [17:45:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:46:12] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:49:22] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:50:12] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:50:12] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [17:50:33] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 52065 MB (10% inode=99%) [17:51:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:51:23] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:52:03] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:52:05] Elastic is still slow. [17:52:42] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 52571 MB (10% inode=99%) [17:53:22] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:53:23] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [17:56:32] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:56:33] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:57:13] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:57:32] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:57:33] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:58:33] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:58:33] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:58:42] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [17:58:52] RECOVERY - Disk space on elastic1020 is OK: DISK OK [17:59:02] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:59:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:59:29] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py.trusty: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452435 (https://phabricator.wikimedia.org/T201504) [17:59:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [18:00:02] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py.trusty: make multi-region [puppet] - 10https://gerrit.wikimedia.org/r/452435 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:13] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [18:00:43] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:01:34] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [18:01:43] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [18:02:53] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [18:02:53] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [18:03:12] (03PS1) 10BBlack: Update alexa image block to just 1500px URIs [puppet] - 10https://gerrit.wikimedia.org/r/452437 [18:04:06] is all the above 5xx noise related to recommendation_api? [18:04:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:04:44] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:05:24] (03PS13) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [18:05:29] Elasticsearch is also not completely back it seems [18:05:42] (03CR) 10Krinkle: "Hm.. Gerrit said "Merge conflict" but seems to rebase fine? Not sure why it failed to land." [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [18:05:44] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [18:05:53] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:05:56] (03PS4) 10Krinkle: webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) [18:06:03] (03PS1) 10Gehel: [cirrus] switch search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452438 [18:06:17] \o/ [18:06:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:06:43] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:07:03] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:07:03] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:07:57] (03CR) 10EBernhardson: [C: 031] [cirrus] switch search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452438 (owner: 10Gehel) [18:08:24] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [18:08:54] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [18:08:54] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:09:44] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:09:54] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:10:43] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:11:44] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [18:11:54] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:12:47] does someone have a one-line summary of what's going on? We've got quite a spam of alerts above (5xx's, recommendation_api, MW fatals?) ove rat least the past hour or so? [18:13:03] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [18:13:04] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [18:13:04] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [18:13:04] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [18:13:29] (03PS1) 10Ottomata: Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) [18:13:46] I'll even take a 3-liner! :) [18:13:54] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:14:07] (03PS11) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [18:14:17] (03PS5) 10Krinkle: webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) [18:14:37] bblack: no idea no [18:14:54] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:15:04] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:16:26] (03CR) 10jerkins-bot: [V: 04-1] Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:17:03] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [18:17:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:17:23] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.030 second response time [18:17:37] (03PS2) 10Ottomata: Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) [18:17:56] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) I was able to find @Lea_WMDE's username by searching LDAP for the email address like so: `[mwmaint1001:~] $ ldapsearch -x mail=lea.voget@wikimedia.de` uid: lea-wm... [18:18:03] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [18:18:03] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [18:18:13] (03CR) 10jerkins-bot: [V: 04-1] Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:20:04] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:20:29] (03CR) 10BBlack: [C: 032] Update alexa image block to just 1500px URIs [puppet] - 10https://gerrit.wikimedia.org/r/452437 (owner: 10BBlack) [18:20:43] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) 05stalled>03Resolved a:05Lea_WMDE>03Dzahn done. Lea has been added to "wmde" as well. [18:21:24] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:22:19] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) grafana-admin access - https://phabricator.wikimedia.org/T199966 (10Dzahn) Lea has been added to the "wmde" LDAP group in the linked subtask. This should also resolve this. The same comments on T199965#4477710 also a... [18:22:30] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) [18:22:34] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) grafana-admin access - https://phabricator.wikimedia.org/T199966 (10Dzahn) 05Open>03Resolved a:03Dzahn [18:23:15] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.196 second response time [18:24:04] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:24:25] (03PS3) 10Ottomata: Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) [18:25:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:25:30] (03CR) 10jerkins-bot: [V: 04-1] Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:26:52] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10Patch-For-Review, 10User-Addshore: Give Bmueller grafana-admin access - https://phabricator.wikimedia.org/T199965 (10Addshore) >>! In T199965#4477710, @herron wrote: > The requested Grafana access should now be working. Although as Dzahn explained earl... [18:27:13] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:27:53] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) grafana-admin access - https://phabricator.wikimedia.org/T199966 (10Addshore) It doesn't look like lea has been added to the grafana-admin ldap group or the nda ldap group, so I don't believe she will currently be able... [18:28:00] mutante: ^^ [18:28:33] (03CR) 10EBernhardson: [C: 04-1] "not needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452438 (owner: 10Gehel) [18:29:24] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [18:29:53] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:30:08] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) [18:30:15] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10Patch-For-Review, 10User-Addshore: Give Bmueller grafana-admin access - https://phabricator.wikimedia.org/T199965 (10Dzahn) 05Resolved>03Open I wasn't aware of any group called "grafana-admin". Apparently there is one though (with just 4 users in it??) [18:30:45] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) [18:30:48] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) grafana-admin access - https://phabricator.wikimedia.org/T199966 (10Dzahn) 05Resolved>03Open I wasn't aware of any group called "grafana-admin". Apparently there is one though (with just 4 users in it??) [18:31:14] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10Patch-For-Review, 10User-Addshore: Give Bmueller "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199965 (10Addshore) [18:31:21] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199966 (10Addshore) [18:31:39] mutante: yes, so everyone else with access is in the nda group [18:31:48] (03PS4) 10Ottomata: Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) [18:32:45] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199966 (10Addshore) Indeed, as there is overlap with the NDA group (which has many more users) As far as I understand it grafana-admin should be u... [18:33:14] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:34:52] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199966 (10Dzahn) Well, if access to grafana requires an NDA and signing an NDA means being added to the group called "nda" and that group gives acc... [18:34:55] (03CR) 10Ottomata: [C: 032] "No op https://puppet-compiler.wmflabs.org/compiler02/12069/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:35:02] (03PS5) 10Ottomata: Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) [18:35:04] (03CR) 10Ottomata: [V: 032 C: 032] Add parameters for camus::job to pass to CamusPartitionChecker [puppet] - 10https://gerrit.wikimedia.org/r/452442 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:35:14] (03PS1) 10Andrew Bogott: nfs-exportd: fix our call to region.list() [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) [18:35:54] !log otto@deploy1001 Started deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908 [18:35:54] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199966 (10Addshore) >>! In T199966#4499424, @Dzahn wrote: > Well, if access to grafana requires an NDA and signing an NDA means being added to the... [18:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:02] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [18:36:21] (03PS1) 10Krinkle: webperf: Switch webperf::site to use arclamp from webperf-2 [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) [18:36:23] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) [18:36:23] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:36:25] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10User-Addshore: Give Lea Voget (WMDE) "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199966 (10Dzahn) 05Open>03Resolved added to "nda" -> https://tools.wmflabs.org/ldap/user/lea-wmde [18:36:28] (03PS2) 10Andrew Bogott: nfs-exportd: fix our call to region.list() [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) [18:38:25] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10Dzahn) [18:38:28] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10Patch-For-Review, 10User-Addshore: Give Bmueller "grafana-admin" LDAP group access - https://phabricator.wikimedia.org/T199965 (10Dzahn) 05Open>03Resolved Since Birgit is already in the "nda" group this should also give her access to grafana. [18:38:32] thanks mutante :D [18:40:31] (03CR) 10Bstorm: nfs-exportd: fix our call to region.list() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [18:40:39] addshore: no problem. i still need to add her to the puppet manifests. checking [18:41:24] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:42:24] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:42:39] (03PS19) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [18:42:46] (03PS3) 10Andrew Bogott: nfs-exportd: fix our call to region.list() [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) [18:43:01] (03PS1) 10Dzahn: admins: add Lea Voget to ldap admins (wmde,nde) [puppet] - 10https://gerrit.wikimedia.org/r/452450 (https://phabricator.wikimedia.org/T199966) [18:44:03] (03PS1) 10Ottomata: Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) [18:44:05] (03CR) 10Dzahn: [C: 032] admins: add Lea Voget to ldap admins (wmde,nde) [puppet] - 10https://gerrit.wikimedia.org/r/452450 (https://phabricator.wikimedia.org/T199966) (owner: 10Dzahn) [18:44:10] (03CR) 10Bstorm: [C: 032] nfs-exportd: fix our call to region.list() [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [18:44:21] (03PS4) 10Bstorm: nfs-exportd: fix our call to region.list() [puppet] - 10https://gerrit.wikimedia.org/r/452448 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [18:44:44] (03CR) 10jerkins-bot: [V: 04-1] Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [18:45:30] (03PS20) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [18:45:41] (03PS8) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) [18:45:43] (03CR) 10Krinkle: "Has the expected impact in puppet-compiler diff:" [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [18:45:53] (03PS7) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 (https://phabricator.wikimedia.org/T191183) [18:46:54] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:46:54] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:03] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:03] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:04] !log otto@deploy1001 Finished deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908 (duration: 11m 10s) [18:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:11] T198908: Alarms on throughput on camus imported data - https://phabricator.wikimedia.org/T198908 [18:47:23] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:34] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:49:53] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:50:13] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:50:13] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:50:14] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:50:14] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:50:34] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:51:11] (03PS21) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [18:51:33] (03PS2) 10Ottomata: Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) [18:52:00] (03PS3) 10Ottomata: Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) [18:52:30] (03PS1) 10Bstorm: labstore: nfs-exportd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/452454 [18:53:09] 10Operations, 10Security-Team: deploy drupal as a GRC CMS for risk and compliance management - https://phabricator.wikimedia.org/T201860 (10chasemp) p:05Triage>03Normal [18:53:31] (03CR) 10Bstorm: [C: 032] labstore: nfs-exportd typo fix [puppet] - 10https://gerrit.wikimedia.org/r/452454 (owner: 10Bstorm) [18:55:44] (03PS4) 10Ottomata: Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) [18:56:01] Just got this on puppet-merge: [18:56:01] `2018-08-13 18:54:09 [INFO] conftool::yaml_log_error: Error parsing yaml file /etc/conftool/etcdrc: [Errno 2] No such file or directory: '/etc/conftool/etcdrc'` [18:56:15] What happened there? [18:57:29] _joe_ or volans: any thoughts on that? [18:58:10] <_joe_> bstorm_: that's ok, I've been meaning to suppress that alarm [18:58:14] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:58:17] Ok, thanks [18:58:28] oh, already answered :) [18:58:28] <_joe_> it's been introduced with one of the latest versions [18:58:38] Looks like what I was trying to fix just got fixed 😅 [18:58:51] I see [18:58:53] (03PS12) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [18:58:54] Thanks :) [19:00:42] (03CR) 10Volans: "Refactor and tests completed, ready to review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:04:01] (03CR) 10Ottomata: "Looks good: https://puppet-compiler.wmflabs.org/compiler02/12073/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [19:04:18] (03CR) 10Ottomata: [C: 032] Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [19:04:24] (03PS5) 10Ottomata: Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) [19:04:26] (03CR) 10Ottomata: [V: 032 C: 032] Enable check email reporting for camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/452451 (https://phabricator.wikimedia.org/T198908) (owner: 10Ottomata) [19:07:18] 10Operations, 10Services (watching): RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 (10Eevans) [19:07:28] 10Operations, 10Services (watching): RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 (10Eevans) p:05Triage>03Normal [19:08:18] (03Abandoned) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [19:09:58] (03PS1) 10Andrew Bogott: Horizon: set DEFAULT_SERVICE_REGION to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/452457 (https://phabricator.wikimedia.org/T201504) [19:10:52] (03CR) 10Andrew Bogott: [C: 032] Horizon: set DEFAULT_SERVICE_REGION to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/452457 (https://phabricator.wikimedia.org/T201504) (owner: 10Andrew Bogott) [19:18:11] (03PS1) 10Andrew Bogott: Shinken: hard-code to region 'eqiad'. [puppet] - 10https://gerrit.wikimedia.org/r/452460 [19:19:18] (03CR) 10Andrew Bogott: [C: 032] Shinken: hard-code to region 'eqiad'. [puppet] - 10https://gerrit.wikimedia.org/r/452460 (owner: 10Andrew Bogott) [19:23:53] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [19:25:25] (03PS1) 10Herron: WIP: logstash: add ids to filter configs [puppet] - 10https://gerrit.wikimedia.org/r/452461 [19:30:13] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type=stop_podsandbox https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:31:13] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:32:13] !log reindexing Malay wikis on elastic@eqiad and elastic@codfw (T200204) [19:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:20] T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204 [19:38:24] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10debt) Hi @mobrovac and @Pchelolo -- is there anything left to be done on this ticket, as it looks like the Grafana Dashboard was the last... [19:44:09] (03CR) 10Herron: [C: 04-1] "Working on something else today I was reminded that 'id' is a facter fact. So I think we should use a variable name other than 'id' since" [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [19:45:47] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) The service is not yet deployed, so let's wait a little bit. before closing it. more things might come up that would deserve to... [19:46:37] (03PS2) 10Herron: WIP: logstash: add ids to filter configs [puppet] - 10https://gerrit.wikimedia.org/r/452461 [19:49:12] (03PS1) 10Superyetkin: Set $wgCategoryCollation = uca-tr on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452464 (https://phabricator.wikimedia.org/T184943) [19:55:09] (03PS3) 10Herron: WIP: logstash: add ids to filter configs [puppet] - 10https://gerrit.wikimedia.org/r/452461 [19:58:04] (03PS1) 10MSantos: Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T2000). [20:00:45] (03CR) 10Zhuyifei1999: [C: 031] "In the future, we should probably move the errors and warnings to stderr rather than keeping it in stdout. Right now please rebase the cha" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [20:08:10] No ORES deployments today. [20:28:24] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) 05Open>03Resolved [20:44:49] (03PS1) 10Smalyshev: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) [20:45:34] (03CR) 10jerkins-bot: [V: 04-1] Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [20:47:33] (03PS2) 10Smalyshev: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) [21:00:05] bawolff and Reedy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T2100). [21:00:14] Woo [21:01:23] (03PS4) 10Filippo Giunchedi: logstash: add 'id' to inputs configuration [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) [21:05:24] (03CR) 10Filippo Giunchedi: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [21:05:25] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/12079/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [21:11:15] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [21:12:27] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10Dzahn) @Rossi.dario.g Please create a wikitech user at https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page and let us... [21:12:37] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Dzahn) [21:12:40] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10Dzahn) 05Open>03stalled [21:14:27] Time to deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/452407/ [21:15:27] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) @Cmjohnson Could you pre-cable the hosts that will terminate on asw2-a5(ex4500) ? Not unplug anything, but have the fibers ready. [21:16:12] (03CR) 10Brian Wolff: [C: 032] Adjust CSP settings. For this stage allow inline but restrict src [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452407 (https://phabricator.wikimedia.org/T135963) (owner: 10Brian Wolff) [21:16:14] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Diego da Hora - https://phabricator.wikimedia.org/T201197 (10Dzahn) Hi @gilles, we have not heard from this user yet. We still need them to: sign L3, create a wikitech user and provide us with a SSH key to be able to move forward with... [21:16:24] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Marc Jeanmougin - https://phabricator.wikimedia.org/T201198 (10Dzahn) Hi @gilles, we have not heard from this user yet. We still need them to: sign L3, create a wikitech user and provide us with a SSH key to be able to move forward wi... [21:16:36] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10Dzahn) Hi @gilles, we have not heard from this user yet. We still need them to: sign L3, create a wikitech user and provide us with a SSH key to be able to move forward wi... [21:17:30] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) 05Open>03stalled [21:17:33] (03Merged) 10jenkins-bot: Adjust CSP settings. For this stage allow inline but restrict src [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452407 (https://phabricator.wikimedia.org/T135963) (owner: 10Brian Wolff) [21:18:38] (03CR) 10Thcipriani: [C: 031] "This patch is cherry-picked in beta and can merge at any time with no impact." [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:20:39] (03PS2) 10Dzahn: Beta: remove deployment-{tin,mira} [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:20:45] (03CR) 10jenkins-bot: Adjust CSP settings. For this stage allow inline but restrict src [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452407 (https://phabricator.wikimedia.org/T135963) (owner: 10Brian Wolff) [21:21:09] (03CR) 10Dzahn: [C: 032] Beta: remove deployment-{tin,mira} [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:21:15] mutante: <3 thank you! [21:21:48] thcipriani: yw, thanks for removing those names :) [21:24:13] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:25:13] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1049 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [21:25:36] * volans checking [21:25:38] Aug 13 17:29:36 elastic1049 ferm[889]: DNS query for 'elastic1017.eqiad.wmnet' failed: NXDOMAIN [21:25:42] yeah [21:25:46] I'm running puppet [21:25:51] to see if clears it out [21:26:11] elastic1017.eqiad.wmnet has address 10.64.48.39 [21:26:46] yeah, puppet was a noop [21:27:16] I'm tempted to restart ferm but I'm worried it will make a micro-down on the network [21:28:18] volans: isn't it the other way? When i've downed ferm before it made everything open [21:28:21] it would just be starting it, but that fails [21:28:42] well, let's try starting it agian [21:29:04] mutante: ack [21:29:05] go ahead [21:29:14] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1049 is OK: OK ferm input default policy is set [21:29:21] !log elastic1049 - started ferm [21:29:23] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [21:29:23] ebernhardson: stop vs fail might be different, I'm not entirely sure [21:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:35] I expect stop to clear everything, crash to maybe leave it as is? [21:29:53] it was in status "failed" and the rules were gone [21:30:04] my ssh shell got stuck [21:30:43] my connection survived and i can ssh to it [21:30:43] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy adjust CSP settings I74a79dc7defaa (duration: 00m 52s) [21:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:59] ebernhardson: ah, it was the first start after a reboot [21:31:18] ooh.. uptime only 4 minutes.. didnt realize [21:31:29] did anyone rebooted/reimaged it today? [21:31:36] otherwise it might have just crashed [21:31:40] and you might want to depool it [21:32:40] ah ok T193649 [21:32:41] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [21:32:50] but it was 2h ago [21:33:09] i see :) could not find it by host name [21:33:10] sorry 4h [21:33:25] that's weird, cc gehel ^^^ [21:33:46] looking... [21:34:04] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:34:07] gehel: did you rebooted elastic1049? [21:34:19] elastic1049 was reimaged ~4h ago, it completed without issue [21:34:26] yeah but has uptime of 6m [21:34:27] maybe that icinga check only checks every few hours? [21:34:49] oh [21:36:29] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10debt) >>! In T186748#4499730, @Pchelolo wrote: > The service is not yet deployed, so let's wait a little bit. before closing it. more thin... [21:36:40] mutante: are you also talking about elastic1049? [21:36:44] ok this is the reason why I should not look at those things after a certain time [21:37:04] gehel: yes [21:37:15] gehel: sorry uptime is 4h [21:37:15] where do you see an uptime of 6m? [21:37:18] and I was blind [21:37:22] but ferm failed after the reboot [21:37:25] Oh no, that was volans [21:37:27] Aug 13 17:29:36 elastic1049 systemd[1]: ferm.service: Main process exited, code=exited, status=255/n/a [21:37:48] volans: but back to normal after a puppet run? [21:37:53] * gehel is still reading backlog [21:37:55] no, a puppet run did nothing [21:38:01] back after starting ferm [21:38:12] strange [21:38:41] ferm said it could not lookup the DNS name of elastic1017 [21:38:55] but now it's fine [21:39:03] mutante: it alarmed just now because of dowtime of the reimage that is 4h [21:39:23] volans: oh:) that makes sense [21:39:37] i also read that uptime wrong :p [21:42:19] sorry about it, I'll blame that it's late here [21:42:22] :) [21:42:23] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [21:43:51] (03CR) 10Filippo Giunchedi: "See inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [21:47:13] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) > Gotcha, is there an estimated date for deployment @Pchelolo? 1 year ago? :) On a serious side - things keep popping up, so w... [21:49:07] (03CR) 10Legoktm: Removing gridengine as default and encouraging the use of Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [21:54:10] (03CR) 10Dzahn: "this broke puppet on stat1005" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [21:54:51] (03CR) 10Filippo Giunchedi: [C: 032] logstash: add 'id' to inputs configuration [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [21:54:59] (03PS5) 10Filippo Giunchedi: logstash: add 'id' to inputs configuration [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) [21:55:09] gehel: i think https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445254/ broke puppet on stat1005 [21:55:24] (03CR) 10Legoktm: [C: 032] Add support for php7.2 image/backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452353 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [21:55:56] mutante: damn, looking (but late enough here that brain is only half working) [21:56:01] ebernhardson: ^ [21:56:09] (03Merged) 10jenkins-bot: Add support for php7.2 image/backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452353 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [21:56:19] (03Abandoned) 10Gehel: [cirrus] switch search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452438 (owner: 10Gehel) [21:57:33] gehel: i think it doesn't have to be right now, i can just ack it [21:58:17] mutante: ok, I'll add that to my list of things to look at tomorrow [21:59:25] It seems like zuul is having some delays on post-merge builds https://integration.wikimedia.org/zuul/ [21:59:37] gehel: ok, cool [21:59:44] ACKNOWLEDGEMENT - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445254/ [22:04:24] The coverage builds seem to also be having trouble [22:06:38] (03CR) 10Catrope: [C: 031] "This is now ready, and scheduled for the next SWAT in about an hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [22:06:41] coverage and postmerge are lowest priority [22:07:00] bawolff: https://integration.wikimedia.org/ci/view/Default/job/mediawiki-phpunit-coverage-patch-docker/ seems like it's going OK, but a bit resource starved perhaps? [22:07:12] zuul will only process them if the rest of the queue is empty [22:07:17] Right. [22:07:17] ah ok [22:07:30] gehel: looking [22:07:31] So the list going on forever is not something I should worry about [22:07:43] I say the time of 1 hour and 40 minutes and thought that looked off [22:08:08] 10Operations, 10Beta-Cluster-Infrastructure, 10Jenkins, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10Dzahn) @thcipriani resolved? [22:08:17] It'd be nice if the low-priority queues collapsed rather than took up so much space, but hey. :-) [22:09:02] for those two queues, being backed up is totally normal [22:09:13] they usually clear up in the afternoon/evening once people stop working :) [22:10:13] So is it totally normal the gate-submit-and-swat I'm waiting on to take 17 minutes? [22:10:16] (03PS1) 10EBernhardson: Provide logstash host/port appropriately in analytics elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/452574 [22:10:33] gehel: I think this is correct ^ mjolnir should come in through the profile now and not directly [22:10:47] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10thcipriani) [22:10:50] 10Operations, 10Beta-Cluster-Infrastructure, 10Jenkins, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10thcipriani) 05Open>03Resolved a:03thcipriani >>! In T192561#4499987, @Dzahn wrote: >... [22:11:39] (03PS1) 10Legoktm: Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452575 [22:12:03] bawolff: about ~20 is the new norm [22:12:10] (03CR) 10Legoktm: [C: 032] Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452575 (owner: 10Legoktm) [22:12:26] Ok. Just wanted to make sure there wasn't anything wrong [22:12:55] (03Merged) 10jenkins-bot: Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/452575 (owner: 10Legoktm) [22:20:06] ebernhardson: lgtm. i compiled that [22:20:46] (03CR) 10Dzahn: [C: 031] "this makes it compile: http://puppet-compiler.wmflabs.org/12080/stat1005.eqiad.wmnet/change.stat1005.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/452574 (owner: 10EBernhardson) [22:21:54] 10Operations, 10Services (watching): RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 (10fgiunchedi) I've renewed the certs on `restbase-dev*` and ran puppet. Next up is cassandra roll restart to pick up the certs. [22:29:01] !log reindexing Malay wikis on elastic@eqiad and elastic@codfw complete (T200204) [22:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:08] T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204 [22:29:27] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.16/includes/ContentSecurityPolicy.php: backport I3cd2d7cc295c39 (CSP fixes) (duration: 00m 51s) [22:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:00] (03CR) 10Dzahn: [C: 031] "only affects stat1005 and puppet was already broken there for a while.. so going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/452574 (owner: 10EBernhardson) [22:30:02] (03CR) 10Dzahn: [C: 032] Provide logstash host/port appropriately in analytics elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/452574 (owner: 10EBernhardson) [22:30:44] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.16/includes/OutputPage.php: backport I3cd2d7cc295c39 (CSP fixes) (duration: 00m 50s) [22:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:02] !log reindexing Indonesian wikis on elastic@eqiad and elastic@codfw (T200204) [22:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:04] PROBLEM - Varnish backend child restarted on cp1089 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops [22:32:04] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.16/resources/src/startup/mediawiki.js: backport I3cd2d7cc295c39 (CSP fixes) (duration: 00m 50s) [22:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:37:42] (03PS1) 10Filippo Giunchedi: Assign graphite1004 its role [puppet] - 10https://gerrit.wikimedia.org/r/452576 (https://phabricator.wikimedia.org/T196484) [22:38:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:38:37] (03CR) 10Dzahn: [C: 032] "i ran puppet on stat1005 and it took a while but eventually finished. this fixed it. thx" [puppet] - 10https://gerrit.wikimedia.org/r/452574 (owner: 10EBernhardson) [22:38:54] (03CR) 10Filippo Giunchedi: [C: 032] Assign graphite1004 its role [puppet] - 10https://gerrit.wikimedia.org/r/452576 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [22:39:18] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452574/ fixed it.. merged and ran puppet on stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [22:39:23] (03PS2) 10Filippo Giunchedi: Assign graphite1004 its role [puppet] - 10https://gerrit.wikimedia.org/r/452576 (https://phabricator.wikimedia.org/T196484) [22:40:34] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:43:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Dzahn) now on Icinga Service - Device not healthy -SMART- On Host labvirt1003 cluster=labvirt device=cciss,17 instance=labvirt1003:9100 job=node site=eqiad https://icinga.w... [22:44:08] ACKNOWLEDGEMENT - Device not healthy -SMART- on labvirt1003 is CRITICAL: cluster=labvirt device=cciss,17 instance=labvirt1003:9100 job=node site=eqiad daniel_zahn https://phabricator.wikimedia.org/T200203 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labvirt1003&var-datasource=eqiad%2520prometheus%252Fops [22:59:51] (03PS1) 10Filippo Giunchedi: graphite: fix graphite-manage stretch command [puppet] - 10https://gerrit.wikimedia.org/r/452578 (https://phabricator.wikimedia.org/T196484) [22:59:54] (03PS1) 10Filippo Giunchedi: graphite: mirror metric traffic to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/452579 (https://phabricator.wikimedia.org/T196484) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180813T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:47] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:01:57] (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix graphite-manage stretch command [puppet] - 10https://gerrit.wikimedia.org/r/452578 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [23:02:44] (03PS2) 10Filippo Giunchedi: graphite: fix graphite-manage stretch command [puppet] - 10https://gerrit.wikimedia.org/r/452578 (https://phabricator.wikimedia.org/T196484) [23:03:24] I'll do my own SWAT [23:03:56] (03PS2) 10Catrope: Enable wp10 and draftquality ORES models on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) [23:04:01] (03CR) 10Catrope: [C: 032] Enable wp10 and draftquality ORES models on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [23:05:34] (03Merged) 10jenkins-bot: Enable wp10 and draftquality ORES models on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [23:06:17] (03CR) 10Filippo Giunchedi: [C: 032] graphite: mirror metric traffic to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/452579 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [23:06:31] (03PS2) 10Filippo Giunchedi: graphite: mirror metric traffic to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/452579 (https://phabricator.wikimedia.org/T196484) [23:07:29] RoanKattouw: please LMK when done with swat [23:08:07] (03CR) 10jenkins-bot: Enable wp10 and draftquality ORES models on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [23:13:01] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable wp10 and draftquality ORES models on testwiki (T198997) (duration: 00m 51s) [23:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:08] T198997: Enable wp10 and draftquality models for testwiki - https://phabricator.wikimedia.org/T198997 [23:18:33] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) Note that "kbrown" is a username already taken in LDAP and it's **KEVIN Brown**, not Karen. @Jalexander @Kbrown Could y... [23:18:39] godog: Done now [23:18:58] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) a:05Dzahn>03Kbrown [23:20:57] RoanKattouw: thanks! [23:23:39] (03PS1) 10Dzahn: admins: add shell user for Karen Brown [puppet] - 10https://gerrit.wikimedia.org/r/452583 (https://phabricator.wikimedia.org/T201668) [23:24:33] !log restart carbon-c-relay on graphite1001 to mirror traffic to graphite1004 - T196484 [23:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:39] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [23:26:29] !log Restart Cassandra in RESTBase dev env -- T201863 [23:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:35] T201863: RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 [23:27:03] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) Hi @PEarleyWMF @Jalexander Could you please create a user on Wikitech/LDAP (https://wikitech.wikimedia.org/w/index.php?... [23:27:22] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) a:05Dzahn>03PEarleyWMF [23:28:08] (03CR) 10Dzahn: [C: 04-2] "need a UID / wikitech user" [puppet] - 10https://gerrit.wikimedia.org/r/452583 (https://phabricator.wikimedia.org/T201668) (owner: 10Dzahn) [23:30:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @JanWMF Do you approve of this request? [23:32:31] (03PS2) 10Dzahn: graphite::carbon_c_relay: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) [23:33:14] (03CR) 10Dzahn: "the restart of carbon_c_relay reminded me of this. should be ok, see requested compiler output" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:37:00] 10Operations, 10Services (watching): RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 (10Eevans) LGTM [23:37:50] 10Operations, 10Services (watching): RESTBase dev environment (Cassandra) SSL certificates expired - https://phabricator.wikimedia.org/T201863 (10fgiunchedi) 05Open>03Resolved Nice, thanks! [23:51:28] (03CR) 10Fdans: [C: 031] "Looks good to me! Waiting on what ottomata has to say about your question :)" [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)