[00:10:29] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10RobH) [00:11:36] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10RobH) a:05RobH>03elukey So this should likely get assigned to either @elukey or @joe, and since Luca commented, to him it goes! These can now be pressed into service, I lef... [00:18:12] PROBLEM - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:22:21] PROBLEM - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-pdns-exporter] [00:32:16] (03PS1) 10Catrope: Enable wp10 and draftquality ORES models on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) [00:32:59] (03CR) 10Catrope: [C: 04-2] "ORES service support doesn't seem to have been deployed yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [00:34:50] (03CR) 10Awight: "I'm not quite sure what this means," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [00:36:32] (03CR) 10Awight: "Aha, I see now—deployed for enwiki but testwiki has its own models. We should be able to deploy this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451222 (https://phabricator.wikimedia.org/T198997) (owner: 10Catrope) [00:40:38] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 50684 MB (10% inode=99%) [00:41:08] (03CR) 10Dzahn: "setup a test instance here: http://design-test.wmflabs.org/style-guide/" [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) (owner: 10Dzahn) [00:43:59] (03CR) 10Dzahn: "http://design-test.wmflabs.org/style-guide/RANDOMSTUFF gets redirected as requested" [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) (owner: 10Dzahn) [00:56:38] RECOVERY - Disk space on elastic1017 is OK: DISK OK [00:57:40] (03CR) 10Dzahn: [C: 04-1] "nope.. redirect loop" [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) (owner: 10Dzahn) [01:48:07] 10Operations, 10Wikidata: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC - https://phabricator.wikimedia.org/T198049 (10tstarling) > db1071, the master, had no writes It actually had a factor of 10 fewer writes, not zero writes. I'm pretty sure there was no outage. I had a... [02:05:55] (03PS2) 10Dzahn: design.wm.org: add apache redirect for style-guide/wiki/ [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) [02:07:47] (03PS3) 10Dzahn: design.wm.org: add apache redirect for style-guide/wiki/ [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) [02:09:27] (03PS4) 10Dzahn: design.wm.org: add apache redirect for style-guide/wiki/ [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) [02:11:49] (03CR) 10Dzahn: [C: 032] "tested on design-test.wmflabs.org/style-guide/wiki/Foo and on #wikimedia-design with prtksxna" [puppet] - 10https://gerrit.wikimedia.org/r/451204 (https://phabricator.wikimedia.org/T200304) (owner: 10Dzahn) [02:15:26] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) [02:15:30] 10Operations, 10Domains, 10Traffic, 10WikimediaUI Style Guide, 10Patch-For-Review: Redirect design.wikimedia.org/style-guide/wiki/* to design.wikimedia.org/style-guide/ - https://phabricator.wikimedia.org/T200304 (10Prtksxna) 05Open>03Resolved Thanks a ton @Dzahn! Works now {icon smile-o} [02:28:35] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.15) (duration: 08m 40s) [02:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:17] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 16m 06s) [03:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:44] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Aug 8 03:12:43 UTC 2018 (duration 10m 26s) [03:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:30] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Some doc provided by JTAC: https://www.juniper.net/documentation/en_US/release-independent/vcf/information-products/pathway-pages/vcf-best-practices-guide.pdf https://www.juniper... [03:21:58] (03CR) 10Krinkle: [C: 031] Scap: update logstash_checker.py mwdeploy query [puppet] - 10https://gerrit.wikimedia.org/r/449639 (owner: 10Thcipriani) [03:26:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 884.48 seconds [03:46:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.65 seconds [05:07:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451229 [05:09:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451229 (owner: 10Marostegui) [05:10:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451229 (owner: 10Marostegui) [05:12:03] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2075 (duration: 01m 06s) [05:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:41] (03PS1) 10Marostegui: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451231 (https://phabricator.wikimedia.org/T201387) [05:20:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451231 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:21:53] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451231 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:21:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451229 (owner: 10Marostegui) [05:22:06] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451231 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:23:31] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2004 - T201387 (duration: 00m 56s) [05:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:36] T201387: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 [05:37:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [05:38:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [06:13:44] !log kartik@deploy1001 Started deploy [cxserver/deploy@6a0cab1]: Update cxserver to 951fdba (T199308, T199512, T199320, T200665, T200453, T106437) [06:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:59] T199320: CX2: References content is duplicated multiple times when adapted - https://phabricator.wikimedia.org/T199320 [06:14:00] T200665: Cite tags are repeated in the MT output - https://phabricator.wikimedia.org/T200665 [06:14:02] T199308: CX2: Avoid using inexistent parameters when mapping template parameters - https://phabricator.wikimedia.org/T199308 [06:14:02] T200453: CX2: Cannot edit videos - https://phabricator.wikimedia.org/T200453 [06:14:02] T106437: Red and grey links keep their styling even when turned into a regular link (with link inspector) - https://phabricator.wikimedia.org/T106437 [06:14:03] T199512: CX2: Improve support for different types of references - https://phabricator.wikimedia.org/T199512 [06:17:16] !log kartik@deploy1001 Finished deploy [cxserver/deploy@6a0cab1]: Update cxserver to 951fdba (T199308, T199512, T199320, T200665, T200453, T106437) (duration: 03m 32s) [06:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:36] (03CR) 10PleaseStand: adding torrelay1001 ipv6 entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/451033 (https://phabricator.wikimedia.org/T196701) (owner: 10RobH) [06:30:19] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:45:22] (03CR) 10Mobrovac: [C: 031] EventStreams now supports multi DC, but should run active/passive [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [06:55:33] <_joe_> I have to talk to otto about this [06:55:51] <_joe_> active/passive isn't really acceptable for a service created in the last few years [06:57:50] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "It's completely unacceptable for a relatively new service to not be active-active in reading. We are even converting *Mediawiki* to be act" [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [07:00:30] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:40] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational [07:03:17] hallo moritzm [07:03:26] can you please take a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451105/ ? [07:03:30] <_joe_> aharoni: I think moritzm is not around [07:04:00] <_joe_> you can ask the person who's on clinic duty though for such a patch [07:04:03] <_joe_> :) [07:04:33] <_joe_> but for that you might need to wait for ~ 1 hour [07:04:56] oh. And who's on clinic duty? [07:05:13] Oh, the clinic will be in 1 hour. [07:05:27] aharoni: see topic [07:05:29] <_joe_> well, godog usually comes online around that time :) [07:06:15] hi [07:06:51] aharoni: sure I'll take a look [07:07:24] thanks jynus , thanks godog [07:08:18] aharoni: what happened btw? [07:09:11] had to reinstall the laptop. I backed up a lot of things, but somehow forgot to back up the keys. [07:09:40] RECOVERY - Check systemd state on cp5012 is OK: OK - running: The system is fully operational [07:10:10] ack [07:10:27] (03PS2) 10Filippo Giunchedi: Replace ssh keys for amire80 [puppet] - 10https://gerrit.wikimedia.org/r/451105 (https://phabricator.wikimedia.org/T201454) (owner: 10Amire80) [07:13:23] (03CR) 10Filippo Giunchedi: [C: 032] Replace ssh keys for amire80 [puppet] - 10https://gerrit.wikimedia.org/r/451105 (https://phabricator.wikimedia.org/T201454) (owner: 10Amire80) [07:14:33] aharoni: {{done}} should be live on the fleet shortly [07:14:48] godog: thanks [07:16:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: update ssh keys for amire80 - August 2018 - https://phabricator.wikimedia.org/T201454 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Public key swapped [07:18:30] (03PS1) 10Jcrespo: mariadb: Depool db1100 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451235 (https://phabricator.wikimedia.org/T201392) [07:18:56] (03PS7) 10Filippo Giunchedi: logstash: add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) [07:22:49] (03PS8) 10Filippo Giunchedi: logstash: add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) [07:24:01] (03CR) 10Filippo Giunchedi: logstash: add jmx_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [07:24:20] (03PS9) 10Filippo Giunchedi: logstash: add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) [07:25:01] (03CR) 10Filippo Giunchedi: [C: 032] logstash: add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/451018 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [07:27:14] !log Stop MySQL on pc2004 - T201387 [07:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:19] T201387: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 [07:30:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) @Papaul as per our chat yesterday, this host is now depooled, silenced and MySQL is stopped. You are good to go to update the BIOS whenever you arrive to the DC!... [07:33:15] (03PS1) 10Filippo Giunchedi: prometheus: add logstash jmx_exporter job [puppet] - 10https://gerrit.wikimedia.org/r/451238 (https://phabricator.wikimedia.org/T200362) [07:34:06] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add logstash jmx_exporter job [puppet] - 10https://gerrit.wikimedia.org/r/451238 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [07:35:03] (03PS1) 10Mohab Fekry: Currently it is not possible to silence the Kafka broker's idle connection reaper on the client side from within varnishkafka, since the librdkafka client configuration property "log.connection.close" is not supported within the varnishkafka allowed subset of config properties. This patch introduces this configuration property as part of the varnishkafka.conf file and sets it to the internal librdkafk [07:35:03] (varnishv51) - 10https://gerrit.wikimedia.org/r/451239 [07:35:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451239 (owner: 10Mohab Fekry) [07:36:47] (03Abandoned) 10Mohab Fekry: Currently it is not possible to silence the Kafka broker's idle connection reaper on the client side from within varnishkafka, since the librdkafka client configuration property "log.connection.close" is not supported within the varnishkafka allowed subset of config properties. This patch introduces this configuration property as part of the varnishkafka.conf file and sets it to the internal lib [07:36:47] (varnishv51) - 10https://gerrit.wikimedia.org/r/451239 (owner: 10Mohab Fekry) [07:38:56] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1100 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451235 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [07:40:13] (03Merged) 10jenkins-bot: mariadb: Depool db1100 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451235 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [07:40:55] (03CR) 10jenkins-bot: mariadb: Depool db1100 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451235 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [07:42:07] (03CR) 10Aklapper: "Hi, thanks for your patch!" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451239 (owner: 10Mohab Fekry) [07:44:25] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10fgiunchedi) Since this request is expanding root scope to other boxes I believe it'll need to be put up at the next SRE meeting on Monday [07:44:34] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10fgiunchedi) p:05Triage>03Normal [07:47:45] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 and db1123 (duration: 00m 58s) [07:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:28] (03PS1) 10Mobrovac: Varnish: Unset X-Request-Id for external requests [puppet] - 10https://gerrit.wikimedia.org/r/451240 (https://phabricator.wikimedia.org/T201409) [07:51:26] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10fgiunchedi) Hi @Rossi.dario.g, this request in particular is to create your user on the WMF cluster. Wikitech users creation is self-service, please create an... [07:51:59] (03PS1) 10Ema: cp2008: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/451241 (https://phabricator.wikimedia.org/T193865) [07:54:20] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10fgiunchedi) To clarify, the procedure to request a developer account is here: https://www.mediawiki.org/wiki/Developer_account [07:55:07] !log stop db1100 for provisioning and upgrade [07:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:53] (03CR) 10Ema: [C: 032] cp2008: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/451241 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [07:56:12] thanks godog, everything works [07:56:31] sweet [07:58:15] (03PS2) 10Mobrovac: Varnish: Unset X-Request-Id for external requests [puppet] - 10https://gerrit.wikimedia.org/r/451240 (https://phabricator.wikimedia.org/T201409) [08:03:58] !log stop db1123 for provisioning and upgrade [08:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:12] (03PS1) 10Ema: cp4023: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/451244 (https://phabricator.wikimedia.org/T193865) [08:32:27] (03CR) 10Ema: [C: 032] cp4023: enable numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/451244 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [08:41:13] !log Drop unused grants from 208.80.154.136 [08:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:18] !log Drop unused grants from 208.80.154.12 [08:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:00] PROBLEM - Device not healthy -SMART- on db1068 is CRITICAL: cluster=mysql device=megaraid,9 instance=db1068:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1068&var-datasource=eqiad%2520prometheus%252Fops [08:58:18] !log Drop unused grants from 208.80.153.48 [08:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:40] (03PS1) 10Ema: Enable numa_networking on all caches [puppet] - 10https://gerrit.wikimedia.org/r/451248 (https://phabricator.wikimedia.org/T193865) [09:03:09] (03CR) 10Ema: [C: 032] Enable numa_networking on all caches [puppet] - 10https://gerrit.wikimedia.org/r/451248 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [09:08:57] !log begin rolling reboots of caches for numa_networking T193865 [09:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:02] T193865: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865 [09:15:29] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10fgiunchedi) a:05RStallman-legalteam>03Lea_WMDE @Lea_WMDE what's your LDAP username to be added to wmde group? [09:16:59] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) p:05Triage>03Normal I am not seeing packet loss anymore after moving to persisted queues, I'm resolving this though feel free to reopen. There is... [09:17:14] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [09:22:28] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10fgiunchedi) [looking at the bikeshed] to me `monitor` is a bit too generic, also this is likely to be a single-use box (i.e. only icinga) so `icinga1001` would work better IMO [09:24:26] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) [09:24:58] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) p:05Triage>03Normal [09:25:03] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10jcrespo) alert1001 too for a compromise of not using a vendor but also not getting confused with debmonitor or dbmonitor or prometheus. icinga1001 is ok too, I don't think that... [09:25:19] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1068 is CRITICAL: cluster=mysql device=megaraid,9 instance=db1068:9100 job=node site=eqiad Marostegui T201493 - The acknowledgement expires at: 2018-08-10 09:25:04. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1068&var-datasource=eqiad%2520prometheus%252Fops [09:26:37] (03PS1) 10Volans: Tests: refactor get_fixture_path() [software/spicerack] - 10https://gerrit.wikimedia.org/r/451253 (https://phabricator.wikimedia.org/T199079) [09:26:39] (03PS1) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 [09:27:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 [09:27:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: add ServerAlias support [puppet] - 10https://gerrit.wikimedia.org/r/451256 [09:27:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 [09:27:25] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [09:27:27] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) [09:27:30] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) [09:28:19] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::vhost: add ServerAlias support [puppet] - 10https://gerrit.wikimedia.org/r/451256 (owner: 10Giuseppe Lavagetto) [09:35:08] (03PS1) 10Giuseppe Lavagetto: Fix logging.basicConfig call in tests [software/conftool] - 10https://gerrit.wikimedia.org/r/451265 [09:35:10] (03PS1) 10Giuseppe Lavagetto: Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 [09:35:12] (03PS1) 10Giuseppe Lavagetto: Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 [09:35:26] <_joe_> volans: ^^ [09:35:32] thanks!!! [09:35:45] (03CR) 10Volans: [C: 031] "LGTM!" [software/conftool] - 10https://gerrit.wikimedia.org/r/451265 (owner: 10Giuseppe Lavagetto) [09:36:44] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [09:37:20] (03CR) 10Volans: [C: 031] "LGTM, don't forget PyPI too :)" [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 (owner: 10Giuseppe Lavagetto) [09:37:39] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 [09:40:19] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix logging.basicConfig call in tests [software/conftool] - 10https://gerrit.wikimedia.org/r/451265 (owner: 10Giuseppe Lavagetto) [09:41:12] (03PS1) 10Jcrespo: mariadb: Repool db1100, db1123 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451269 (https://phabricator.wikimedia.org/T201392) [09:41:39] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Volans) On second thought I agree that `monitor1001` is a bit confusing between monitoring and alarming and metrics and all of that. Ack for either `icinga1001` or, as a fallba... [09:42:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:42:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [09:43:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 (owner: 10Giuseppe Lavagetto) [09:46:10] <_joe_> uhm zuul doesn't seem to be too healthy [09:46:18] (03Merged) 10jenkins-bot: Fix logging.basicConfig call in tests [software/conftool] - 10https://gerrit.wikimedia.org/r/451265 (owner: 10Giuseppe Lavagetto) [09:46:21] (03CR) 10Zhuyifei1999: [C: 032] "Ok, will do that changelog in s separate patch." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [09:47:06] (03CR) 10jerkins-bot: [V: 04-1] Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [09:47:08] (03CR) 10jerkins-bot: [V: 04-1] Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 (owner: 10Giuseppe Lavagetto) [09:47:13] (03Merged) 10jenkins-bot: Kubernetes: ignore terminating objects when searching [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [09:47:55] _joe_: did you merged before CI completed? [09:48:08] <_joe_> no I did not... [09:48:16] <_joe_> I just gave +2 [09:48:55] unit are failing [09:49:17] <_joe_> argh yes [09:49:22] <_joe_> last-second refactor [09:49:35] <_joe_> easy to fix! [09:50:07] yeah seems so [09:50:28] (03PS2) 10Giuseppe Lavagetto: Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 [09:50:30] (03PS2) 10Giuseppe Lavagetto: Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 [09:50:56] <_joe_> now we can wait patiently for jenkins to confirm [09:51:05] * volans sit back and relax [09:51:18] <_joe_> I have to work on other things now, sorry [09:52:31] (03PS10) 10Giuseppe Lavagetto: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 (owner: 10Krinkle) [09:53:15] (03PS1) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [09:53:28] <_joe_> sigh jenkins is sloooow this morning [09:54:05] yesterday was pretty slow as well [09:54:20] I think volans it's coding too fast [09:54:25] we should rate-limit him [09:54:27] <_joe_> now there is a large queue on gearman [09:54:34] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 (owner: 10Krinkle) [09:55:19] (03PS1) 10Zhuyifei1999: Bump debian package version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/451272 [09:55:57] (03CR) 10Zhuyifei1999: [C: 032] Bump debian package version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/451272 (owner: 10Zhuyifei1999) [09:56:00] RECOVERY - Device not healthy -SMART- on db1068 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1068&var-datasource=eqiad%2520prometheus%252Fops [09:56:06] (03CR) 10Volans: "An alternative approach could be to move the cron to systemd timers, that on failure will be catch by the existing checks on systemd units" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/451181 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [09:56:11] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [09:56:51] (03Merged) 10jenkins-bot: Bump debian package version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/451272 (owner: 10Zhuyifei1999) [09:58:27] (03CR) 10jerkins-bot: [V: 04-1] Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [10:01:50] (03PS1) 10Marostegui: production.sql.erb: Change repl user for consistency [puppet] - 10https://gerrit.wikimedia.org/r/451273 (https://phabricator.wikimedia.org/T146149) [10:02:44] (03CR) 10jerkins-bot: [V: 04-1] Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [10:02:51] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [10:02:56] (03CR) 10jerkins-bot: [V: 04-1] Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 (owner: 10Giuseppe Lavagetto) [10:03:23] (03CR) 10Jcrespo: [C: 031] production.sql.erb: Change repl user for consistency [puppet] - 10https://gerrit.wikimedia.org/r/451273 (https://phabricator.wikimedia.org/T146149) (owner: 10Marostegui) [10:03:44] (03CR) 10Marostegui: [C: 032] production.sql.erb: Change repl user for consistency [puppet] - 10https://gerrit.wikimedia.org/r/451273 (https://phabricator.wikimedia.org/T146149) (owner: 10Marostegui) [10:04:19] 10Operations: Fix permissions of /srv/mediawiki-staging/private/README_BEFORE_MODIFYING_ANYTHING on mwdeploy1001 - https://phabricator.wikimedia.org/T201494 (10Addshore) [10:04:36] ^^ an easy one for any opsen [10:06:44] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12016/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [10:07:05] (03PS11) 10Giuseppe Lavagetto: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [10:07:59] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:07:59] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:00] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:09] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:10] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:10] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:10] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:10] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:10] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:19] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:19] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:19] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:20] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:21] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:21] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:22] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:29] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp2022_v4, cp2022_v6 [10:08:30] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:30] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:43] <_joe_> I guess this is unexpected, right? [10:08:49] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:49] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:50] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:50] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:50] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:50] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:08:57] sorry, that's me ^ [10:08:59] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:09:00] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [10:10:00] <_joe_> ema: ack :) [10:11:46] !log power-cycle cp2022, stuck rebooting [10:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] 10Operations: Fix permissions of /srv/mediawiki-staging/private/README_BEFORE_MODIFYING_ANYTHING on mwdeploy1001 - https://phabricator.wikimedia.org/T201494 (10Tgr) 05Open>03Resolved a:03Tgr [10:14:56] (03PS2) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [10:15:17] uh, cp2022 booted into d-i [10:16:06] ema: was it reimaged recently? [10:16:26] volans: I don't think so [10:17:00] PROBLEM - Freshness of zerofetch successful run file on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:00] PROBLEM - dhclient process on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:00] PROBLEM - MD RAID on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:09] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3125: Connection refused [10:17:10] ema: because I audited the whole fleet in april, see T193155 [10:17:11] T193155: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 [10:17:19] PROBLEM - HTTPS Unified RSA on cp2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:17:19] and fixed all the one that had it on PXE [10:17:20] PROBLEM - Varnish traffic logger - varnishmedia on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:20] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3123: Connection refused [10:17:20] PROBLEM - traffic-pool service on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:20] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:24] ack'ing all cp2022-related alerts, sorry for the spam! [10:17:30] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:30] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3127: Connection refused [10:17:34] <_joe_> ema: did you manage to stop it before the reimage started? [10:17:35] the reimage script gives you a WARNING in case after the reimage the PXE is still set [10:17:38] <_joe_> d-i I mean [10:17:39] PROBLEM - confd service on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:39] PROBLEM - Disk space on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:39] PROBLEM - Check systemd state on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:49] PROBLEM - Varnish HTCP daemon on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:49] PROBLEM - Webrequests Varnishkafka log producer on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:50] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:50] PROBLEM - Varnish traffic logger - varnishreqstats on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:51] PROBLEM - configured eth on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:51] PROBLEM - DPKG on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:51] PROBLEM - Confd vcl based reload on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:17:59] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3124: Connection refused [10:18:00] PROBLEM - HTTPS Unified ECDSA on cp2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:18:00] PROBLEM - Varnish traffic logger - varnishstatsd on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:18:10] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3128: Connection refused [10:18:30] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3120: Connection refused [10:19:01] _joe_: I didn't, no [10:19:10] PROBLEM - puppet last run on cp2022 is CRITICAL: Return code of 255 is out of bounds [10:19:39] go with a reimage then :) [10:19:48] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:20:10] (03PS10) 10Giuseppe Lavagetto: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:20:16] volans: yeah I guess we have a good candidate for the next host to be reimaged as stretch! [10:20:23] ahahah [10:20:38] <_joe_> ema: did you depool it? [10:21:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:21:15] _joe_: yeah it was depooled for reboot [10:21:25] (and still is) [10:24:02] cp2022 powered down, will proceed with reimage after lunch [10:24:13] ack [10:24:25] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1100, db1123 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451269 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:24:29] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:18] (03PS8) 10Giuseppe Lavagetto: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:25:29] ACKNOWLEDGEMENT - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% Ema Host powered down. To be reimaged to stretch. [10:25:46] (03Merged) 10jenkins-bot: mariadb: Repool db1100, db1123 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451269 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:26:02] (03PS3) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [10:27:38] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 and db1123 with low load after maint (duration: 00m 57s) [10:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:01] (03CR) 10jenkins-bot: mariadb: Repool db1100, db1123 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451269 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:33:16] there was a spike of api issues, but not related to the deployment [10:33:23] it was on enwiki [10:33:57] ApiQueryRevisions [10:36:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small fix but lgtm overall. I still didn't check the code for more references to the classes renamed here." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:45:15] (03CR) 10Giuseppe Lavagetto: [C: 031] webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:45:50] (03CR) 10Giuseppe Lavagetto: [C: 031] webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:53:10] (03PS1) 10Jcrespo: mariadb-package: Package MariaDB 10.1.35 for stretch [software] - 10https://gerrit.wikimedia.org/r/451280 [10:55:23] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 [10:56:21] (03PS1) 10Mohab Fekry: Extend varnishkafka config properties with the log.connection.close librdkafka config property [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 [10:57:28] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 [10:57:56] (03PS1) 10ArielGlenn: don't call getConfiguration with --groups option, it's not supported [dumps] - 10https://gerrit.wikimedia.org/r/451292 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:28] nice, no patches, no swat [11:01:22] (03CR) 10ArielGlenn: [C: 032] don't call getConfiguration with --groups option, it's not supported [dumps] - 10https://gerrit.wikimedia.org/r/451292 (owner: 10ArielGlenn) [11:02:02] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) 05Resolved>03Open Reopening since I just saw a brief 40 packets/s loss on logstash1008 [11:02:27] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) [11:02:53] !log ariel@deploy1001 Started deploy [dumps/dumps@d6bd774]: fix up getConfiguration invocation [11:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] !log ariel@deploy1001 Finished deploy [dumps/dumps@d6bd774]: fix up getConfiguration invocation (duration: 00m 04s) [11:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:47] hi ops, we've received an alert for: cp2022/Webrequests Varnishkafka log producer is CRITICAL [11:05:07] right now, all present team-mates do not have permits to troubleshoot this, can you help please? [11:05:45] ^ema [11:06:59] mforns: he is not around, but you should not worry about this [11:07:39] that host is not pooled and I think by mistake some false positive alerts were sent [11:08:03] jynus, I've seen: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=cp2022&var-network=eth0&from=now-3h&to=now [11:08:05] Ahhh - Thanks for this info jynus :) --^ [11:08:22] note that is my current understanding, but I amy be wrong [11:08:42] jynus, thanks a lot [11:08:45] what I know is that cp1022 is down on purpose [11:08:51] (03PS1) 10Marostegui: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) [11:08:58] ok cool [11:09:34] but there should be other 16 host taking over its role [11:09:38] (03PS2) 10Arturo Borrero Gonzalez: toolforge: Document inclusion of texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450610 (https://phabricator.wikimedia.org/T197176) (owner: 10BryanDavis) [11:10:29] (03CR) 10Jcrespo: "could you amend the title, as you are also depooling pc2006?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [11:10:41] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: Document inclusion of texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450610 (https://phabricator.wikimedia.org/T197176) (owner: 10BryanDavis) [11:11:51] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [11:11:58] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [11:12:17] 10Operations, 10cloud-services-team, 10decommission, 10hardware-requests, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10aborrero) a:05aborrero>03RobH [11:13:23] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [11:14:30] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2005 - T201387 (duration: 00m 59s) [11:14:33] !log Stop MySQL on pc2005 for BIOS upgrade - T201387 [11:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] T201387: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 [11:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) @Papaul pc2005 is also depooled and with MySQL down. You can upgrade pc2004 and pc2005 at the same time Thanks! [11:19:16] (03PS4) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [11:20:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [11:21:15] !log switching over, stopping, upgrading and restarting labsdb1009 [11:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:40] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451293 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [11:45:21] (03PS4) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [11:45:23] (03PS1) 10Jcrespo: wikireplicas: Monitor wikireplicas are always in read only [puppet] - 10https://gerrit.wikimedia.org/r/451304 (https://phabricator.wikimedia.org/T172489) [11:48:27] (03CR) 10Marostegui: [C: 031] wikireplicas: Monitor wikireplicas are always in read only [puppet] - 10https://gerrit.wikimedia.org/r/451304 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:49:07] (03CR) 10Jcrespo: [C: 032] wikireplicas: Monitor wikireplicas are always in read only [puppet] - 10https://gerrit.wikimedia.org/r/451304 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:49:18] (03PS2) 10Jcrespo: wikireplicas: Monitor wikireplicas are always in read only [puppet] - 10https://gerrit.wikimedia.org/r/451304 (https://phabricator.wikimedia.org/T172489) [11:53:33] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T1200) [12:00:06] !log testing new read only check on labsdb wikireplicas [12:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:11] mforns: yes, what jynus said :) [12:03:22] ema, thanks! :] [12:06:47] (03PS1) 10Ema: cp2022: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/451305 (https://phabricator.wikimedia.org/T200445) [12:08:55] (03CR) 10Ema: [C: 032] cp2022: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/451305 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [12:15:49] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp2022.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [12:16:57] (03PS1) 10Jcrespo: mariadb: Make wikireplicas obey the current mariadb read_only config [puppet] - 10https://gerrit.wikimedia.org/r/451307 (https://phabricator.wikimedia.org/T172489) [12:18:32] (03CR) 10Marostegui: [C: 031] "To me +1 as the only think I can think of that might need it to 0, maintainviews...it has SUPER privilege." [puppet] - 10https://gerrit.wikimedia.org/r/451307 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [12:20:07] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [12:31:25] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2022.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2022.codfw.wmnet'] ``` [12:39:44] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10MarcoAurelio) Hi. Any status updates here? Thanks. [12:44:08] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [12:46:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [12:47:28] (03PS7) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [12:50:50] (03PS5) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [12:51:28] PROBLEM - MariaDB read only wikireplica on labsdb1011 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 5461557s, 1724.32 QPS, connection latency: 0.003490s, query latency: 0.000601s [12:51:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:52:46] see log for the reason of those criticals- they are an alerting test and at the same time a real issue [12:52:57] I am working on it right now [12:53:55] !log switching over, stopping, upgrading and restarting labsdb1010 [12:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T1300) [13:01:44] (03PS6) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [13:02:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [13:05:08] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [13:06:12] (03PS7) 10Vgutierrez: [WIP] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [13:07:57] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cleanup openstack liberty files [puppet] - 10https://gerrit.wikimedia.org/r/451315 [13:10:49] !log BIOS update in progress on pc2004 [13:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:35] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [13:16:44] (03PS3) 10Giuseppe Lavagetto: Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 [13:18:35] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix exception raised when the wrong tags are provided [software/conftool] - 10https://gerrit.wikimedia.org/r/451266 (owner: 10Giuseppe Lavagetto) [13:18:37] (03PS3) 10Giuseppe Lavagetto: Bump version to 1.0.2-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/451267 [13:22:18] RECOVERY - MariaDB read only wikireplica on labsdb1011 is OK: Version 10.1.33-MariaDB, Uptime 5463404s, read_only: True, 1097.08 QPS, connection latency: 0.002121s, query latency: 0.000855s [13:26:47] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) [13:29:15] (03PS8) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [13:31:02] (03PS9) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [13:33:45] (03CR) 10Ema: trafficserver: initial module/profile/role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [13:34:59] !log BIOS update in progress on pc2005 [13:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] (03PS3) 10Ottomata: Fix comment about EventStreams active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) [13:39:49] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) [13:40:47] (03PS1) 10Volans: Doc: fix library examples [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 [13:42:10] (03CR) 10Ottomata: "librdkafka settings should be available by prefixing them with 'kafka.'. Are you sure that" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 (owner: 10Mohab Fekry) [13:44:51] (03CR) 10Mohab Fekry: "> Patch Set 1:" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 (owner: 10Mohab Fekry) [13:45:52] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix comment about EventStreams active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [13:49:21] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Papaul) [13:49:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Papaul) 05Open>03Resolved @Marostegui complete closing the task. Thanks [13:50:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 (10Marostegui) Thanks - I will take it from here, to repool the servers once they've caught up! [13:51:21] (03PS1) 10BBlack: cpNNNN: remove all jessie installer settings [puppet] - 10https://gerrit.wikimedia.org/r/451324 (https://phabricator.wikimedia.org/T200445) [13:51:23] (03PS1) 10BBlack: Spare out the unused eqiad caches for future decom [puppet] - 10https://gerrit.wikimedia.org/r/451325 [13:51:25] (03PS1) 10BBlack: [WIP] Move cache::canary from cp1008 to cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/451326 [13:53:22] <_joe_> bblack: are we making a small ceremony for cp1008 aka the pink unicorn? [13:54:46] (03CR) 10Ema: [C: 031] cpNNNN: remove all jessie installer settings [puppet] - 10https://gerrit.wikimedia.org/r/451324 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [13:55:22] _joe_: we'll still have pinkunicorn the hostname, it will just move around and go through LVS with some new IP :) [13:55:40] <_joe_> it won't be the *same* pink unicorn though [13:55:55] [I don't think there's really much value in having the pinkunicorn be outside LVS at this point, vs the risks] [13:56:29] (03CR) 10Mohab Fekry: "> Patch Set 1:" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 (owner: 10Mohab Fekry) [13:57:50] (03PS4) 10Ottomata: Fix comment about EventStreams active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) [13:57:57] (03CR) 10Ottomata: [V: 032 C: 032] Fix comment about EventStreams active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/451081 (https://phabricator.wikimedia.org/T199433) (owner: 10Ottomata) [13:58:12] (03CR) 10BBlack: [C: 032] cpNNNN: remove all jessie installer settings [puppet] - 10https://gerrit.wikimedia.org/r/451324 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [13:58:23] (03PS2) 10BBlack: cpNNNN: remove all jessie installer settings [puppet] - 10https://gerrit.wikimedia.org/r/451324 (https://phabricator.wikimedia.org/T200445) [13:58:35] (03CR) 10BBlack: [V: 032 C: 032] cpNNNN: remove all jessie installer settings [puppet] - 10https://gerrit.wikimedia.org/r/451324 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [13:58:41] (03CR) 10Ema: [C: 031] Spare out the unused eqiad caches for future decom [puppet] - 10https://gerrit.wikimedia.org/r/451325 (owner: 10BBlack) [14:12:47] (03CR) 10Mohab Fekry: "> Patch Set 1:" [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 (owner: 10Mohab Fekry) [14:12:55] (03Abandoned) 10Mohab Fekry: Extend varnishkafka config properties with the log.connection.close librdkafka config property [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/451281 (owner: 10Mohab Fekry) [14:21:12] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2027.codfw.wmnet'] ``` The log... [14:23:32] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451253 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:23:52] (03CR) 10Bstorm: [C: 031] "Since the index maintainer script is also using maintainviews as the user (and that has super), I think this should be fine. After findin" [puppet] - 10https://gerrit.wikimedia.org/r/451307 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:24:55] (03PS2) 10BBlack: Spare out the unused eqiad caches for future decom [puppet] - 10https://gerrit.wikimedia.org/r/451325 [14:24:57] (03PS2) 10BBlack: [WIP] Move cache::canary from cp1008 to cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/451326 [14:24:59] (03PS1) 10BBlack: remove old eqiad caches from data lists [puppet] - 10https://gerrit.wikimedia.org/r/451328 [14:26:44] (03PS2) 10Jcrespo: mariadb: Make wikireplicas obey the current mariadb read_only config [puppet] - 10https://gerrit.wikimedia.org/r/451307 (https://phabricator.wikimedia.org/T172489) [14:26:57] (03CR) 10Gehel: "Comments inline." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [14:28:38] (03CR) 10Jcrespo: [C: 032] mariadb: Make wikireplicas obey the current mariadb read_only config [puppet] - 10https://gerrit.wikimedia.org/r/451307 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:31:02] (03PS3) 10BBlack: [WIP] Move cache::canary from cp1008 to cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/451326 [14:34:21] (03CR) 10Imarlier: "> I can see this url in Google cache, but for myself, anything on" [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [14:45:05] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [14:45:05] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 68 ESP OK [14:45:06] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 56 ESP OK [14:45:06] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 56 ESP OK [14:45:06] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [14:45:06] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 56 ESP OK [14:45:07] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 56 ESP OK [14:45:07] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 56 ESP OK [14:45:07] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 56 ESP OK [14:45:15] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 56 ESP OK [14:45:15] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [14:45:15] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 56 ESP OK [14:45:16] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [14:45:16] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 56 ESP OK [14:45:16] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 56 ESP OK [14:45:16] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [14:45:25] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 56 ESP OK [14:45:25] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 56 ESP OK [14:45:25] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 56 ESP OK [14:45:25] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 56 ESP OK [14:45:26] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [14:45:35] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 56 ESP OK [14:45:35] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 56 ESP OK [14:45:35] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 56 ESP OK [14:45:35] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [14:45:36] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 68 ESP OK [14:45:36] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 68 ESP OK [14:45:36] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [14:45:36] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [14:45:36] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [14:45:37] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [14:45:46] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [14:45:46] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 56 ESP OK [14:45:46] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [14:45:46] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [14:45:51] orilly [14:46:05] this is me managing to convince cp2022 to boot from disk ^ [14:47:46] sure is chatty ;-0 [14:47:49] er, ;-) [14:47:54] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Nemo_bis) >>! In T52864#4488149, @MarcoAurelio wrote: > Hi. Any status updates here? Thanks. If you are in a hurry to switch to mailman 3, maybe yo... [14:48:25] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 68 ESP OK [14:48:25] apergos: if only you knew how much I had to chat with the host to make it boot properly! [14:48:26] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 56 ESP OK [14:48:26] PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:29] (03CR) 10Volans: Doc: fix library examples (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [14:48:35] heh [14:49:16] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 74132 bytes in 0.430 second response time [14:49:55] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 56 ESP OK [14:50:28] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2027.codfw.wmnet'] ``` and were **ALL** successful. [14:51:05] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2022 is CRITICAL: connect to address 10.192.48.26 and port 3126: Connection refused [14:51:06] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 27 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[varnishkafka],Service[varnishmtail],Package[mtail],Exec[retry-load-new-vcl-file] [14:51:36] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 56 ESP OK [14:52:08] noisy ema is noisy :P [14:52:35] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 56 ESP OK [14:53:19] (03PS3) 10Herron: prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) [14:54:06] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time [14:55:12] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) [14:55:39] !log switching over, stopping, upgrading and restarting labsdb1011 [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:06] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:17] (03CR) 10Gehel: Doc: fix library examples (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [14:57:25] (03CR) 10Filippo Giunchedi: trafficserver: initial module/profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [14:57:38] 10Operations, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) @ayounsi all SFP+-10G-LR are in place . [14:57:52] 10Operations, 10Thumbor, 10Patch-For-Review: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10Urbanecm) 05Open>03Resolved a:03fgiunchedi Works, thanks. [15:01:33] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Cmjohnson) [15:02:38] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Cmjohnson) added both wmf7426 and wmf7433 to the google tracking sheet [15:02:49] (03PS2) 10Volans: Doc: fix library examples [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 [15:03:01] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 56 ESP OK [15:03:10] 10Operations, 10ops-eqiad: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Cmjohnson) [15:04:10] (03CR) 10Volans: "inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [15:04:10] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Cmjohnson) [15:04:11] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Cmjohnson) [15:04:46] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) [15:04:50] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 56 ESP OK [15:05:06] 10Operations, 10ops-eqiad, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Cmjohnson) [15:06:19] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) [15:06:35] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) Added both wmf7426 and wmf7433 to the tracking sheet [15:06:51] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:08:02] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [15:08:04] (03CR) 10Volans: [C: 032] Fix docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:08:53] (03Merged) 10jenkins-bot: Fix docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/450987 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:09:08] (03CR) 10Gehel: "minor comments inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (owner: 10Volans) [15:13:45] Am I right here to ask for removal of an old MediaWiki installation? https://phabricator.wikimedia.org/T166012 [15:13:47] (03PS2) 10BBlack: remove old eqiad caches from data lists [puppet] - 10https://gerrit.wikimedia.org/r/451328 [15:13:59] (03CR) 10BBlack: [C: 032] remove old eqiad caches from data lists [puppet] - 10https://gerrit.wikimedia.org/r/451328 (owner: 10BBlack) [15:14:16] (03PS3) 10BBlack: Spare out the unused eqiad caches for future decom [puppet] - 10https://gerrit.wikimedia.org/r/451325 [15:14:24] (03CR) 10BBlack: [C: 032] Spare out the unused eqiad caches for future decom [puppet] - 10https://gerrit.wikimedia.org/r/451325 (owner: 10BBlack) [15:16:18] (03PS8) 10Vgutierrez: Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [15:18:36] 10Operations, 10Traffic, 10netops: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10ayounsi) Network devices updated. [15:18:38] (03CR) 10Vgutierrez: [C: 04-1] "@Alex please clarify the yaml VS pson issue in the puppet file API metadata response" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [15:20:02] 10Operations, 10Traffic, 10netops: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) Awesome, thanks! [15:20:37] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) [15:20:39] 10Operations, 10Traffic, 10netops: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) 05Open>03Resolved [15:24:53] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) [15:26:13] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) Disk wipe in progress [15:26:34] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) [15:28:18] 10Puppet, 10Toolforge, 10Goal: Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) [15:28:57] 10Puppet, 10Toolforge, 10Goal: Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) I found that the custom complexes were at least partially done, but they weren't used correctly, so they probably have never been tested. [15:33:06] (03CR) 10Volans: [C: 032] Doc: fix library examples [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [15:33:18] (03PS10) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [15:33:22] (03CR) 10Volans: [C: 032] Tests: refactor get_fixture_path() [software/spicerack] - 10https://gerrit.wikimedia.org/r/451253 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:34:17] (03Merged) 10jenkins-bot: Tests: refactor get_fixture_path() [software/spicerack] - 10https://gerrit.wikimedia.org/r/451253 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:34:35] (03CR) 10Ema: trafficserver: initial module/profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [15:36:07] (03Merged) 10jenkins-bot: Doc: fix library examples [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [15:37:30] (03CR) 10jenkins-bot: Doc: fix library examples [software/cumin] - 10https://gerrit.wikimedia.org/r/451321 (owner: 10Volans) [15:40:59] 10Operations, 10ops-eqiad: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Vgutierrez) [15:41:36] 10Operations, 10ops-eqiad, 10Traffic: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Vgutierrez) [15:45:34] !log reimaging cp1046-9 [15:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:15] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) ``` show interfaces ge-6/0/12 Physical interface: ge-6/0/12, Administratively down, Physical link is Down Interface index: 1212, SNMP ifIndex: 76... [15:47:32] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) [15:48:00] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) I've added java threads and heap bytes to the dashboard, looks like there's a thread leak on 2 out of 3 hosts (unclear though if that's involved in packet loss) {F24690529} [15:48:17] (03PS1) 10Vgutierrez: site: Reinstall chromium and hydrogen as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/451360 (https://phabricator.wikimedia.org/T201522) [15:49:25] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) Took thread dumps from 1008 and 1007: https://phabricator.wikimedia.org/P7437 and https://phabricator.wikimedia.org/P7438 [15:51:40] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:51:40] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:51:50] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:51:50] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:51:59] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:04] heh [15:52:09] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:09] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:10] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:10] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:11] right, puppet-disables .... [15:52:18] ahh back again icinga [15:52:19] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:29] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:29] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:29] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:30] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:39] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:39] ignore that, sorry for the spam, working on a silence [15:52:39] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:39] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:52:40] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6 [15:53:06] we have some overlapping uses for puppet disables and needs for puppet runs to update ipsec lists [15:53:12] etoomuchgoingonatonce [15:54:21] !log downtimed all ipsec checks :P [15:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:19] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 34 ESP OK [15:55:46] as long as we know what it is, we can live with the noise [15:56:00] (though someday it would be nice to habe well-behaved icinga, it's not urgent) [15:56:27] (03CR) 10Vgutierrez: [C: 032] site: Reinstall chromium and hydrogen as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/451360 (https://phabricator.wikimedia.org/T201522) (owner: 10Vgutierrez) [15:58:55] !log reimaging cp1050.eqiad.wmnet cp1052.eqiad.wmnet cp1053.eqiad.wmnet cp1054.eqiad.wmnet cp1055.eqiad.wmnet cp1059.eqiad.wmnet [15:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:48] checking out for awhile, back later (maybe much later, depends...) [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:48] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` hydrogen.wikimedia.org ``` The log can be found in `/var... [16:01:59] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` chromium.wikimedia.org ``` The log can be found in `/var... [16:03:29] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2064 [dns] - 10https://gerrit.wikimedia.org/r/451362 [16:06:15] (03CR) 10Marostegui: [C: 032] DNS: Remove mgmt DNS for db2064 [dns] - 10https://gerrit.wikimedia.org/r/451362 (owner: 10Papaul) [16:07:48] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) [16:08:08] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) 05Open>03Resolved This is complete resolving it. [16:08:13] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Marostegui) https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451362/ merged and deployed [16:11:12] (03PS11) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [16:18:36] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 34 ESP OK [16:20:22] !log reimaging cp1060, cp1062-8 [16:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:47] Anyone SWATing? [16:32:57] There's a UBN task I'll deploy if not. [16:33:06] looks like not [16:33:34] OK, I have the conch. [16:38:55] (03PS4) 10Herron: prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) [16:41:06] (03CR) 10Herron: [C: 032] prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [16:42:09] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [16:43:28] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [16:48:21] Eurgh. Keeps randomly failing in CI. [16:53:05] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) I've worked with @awight on a document describing JADE's requirements and possible implementation... [16:53:08] !log reimagine cp1071-4, cp1099 [16:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] I guess "reimagine" fits the bill as well heh [16:53:45] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [16:53:47] * James_F grins. [16:54:43] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [16:57:07] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [16:57:39] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [16:59:46] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 34 ESP OK [17:00:23] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.15/includes/logging/LogFormatter.php: SWAT T185049 Unbreak views of invalid title log entries, wmf.15 (duration: 01m 01s) [17:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:28] T185049: Unable to open edit page, or view Special:Log, for certain pages ("MWException: Expected title, got null" from LogFormatter.php) - https://phabricator.wikimedia.org/T185049 [17:01:56] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.16/includes/logging/LogFormatter.php: SWAT T185049 Unbreak views of invalid title log entries, wmf.16 (duration: 00m 58s) [17:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:34] Deployment done. [17:05:32] * Niharika pats James_F on the back [17:05:32] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['chromium.wikimedia.org'] ``` and were **ALL** successful. [17:05:51] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['hydrogen.wikimedia.org'] ``` and were **ALL** successful. [17:08:15] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2124 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:09:24] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > be curated (patrolled, deleted, etc.) within MediaWiki. The must important question is: how i... [17:19:37] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Dzahn) alert1001 because the puppet role it uses is called "alerting_host", for the reasons Jaime mentioned above and to avoid a specific software name that might change while... [17:24:00] (03PS4) 10Krinkle: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [17:24:05] (03CR) 10Krinkle: [C: 031] "Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [17:30:38] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:31:28] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:32:17] (03PS1) 10Volans: Doc: fix typo in parameter type [software/cumin] - 10https://gerrit.wikimedia.org/r/451379 [17:32:28] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:32:29] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:35:21] (03CR) 10Volans: [C: 032] Doc: fix typo in parameter type [software/cumin] - 10https://gerrit.wikimedia.org/r/451379 (owner: 10Volans) [17:37:24] (03CR) 10Dzahn: "i don't see anything using this. what i do see is that "contint::packages::androidsdk" is used and also installs the xpra and xorg package" [puppet] - 10https://gerrit.wikimedia.org/r/448788 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:37:30] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4489021, @daniel wrote: >> be curated (patrolled, deleted, etc.) within MediaWiki.... [17:38:00] (03Merged) 10jenkins-bot: Doc: fix typo in parameter type [software/cumin] - 10https://gerrit.wikimedia.org/r/451379 (owner: 10Volans) [17:38:03] (03PS2) 10Dzahn: xdummy: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448788 (https://phabricator.wikimedia.org/T194724) [17:39:01] (03CR) 10Dzahn: "@sniedzielski @hashar "contint::packages::androidsdk" seems to install Xdummy and is in use but this module xdummy does not seem to be in " [puppet] - 10https://gerrit.wikimedia.org/r/448788 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:39:17] (03CR) 10jenkins-bot: Doc: fix typo in parameter type [software/cumin] - 10https://gerrit.wikimedia.org/r/451379 (owner: 10Volans) [17:39:49] (03CR) 10Dzahn: [C: 032] xdummy: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448788 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:41:18] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.003213 https://grafana.wikimedia.org/dashboard/db/logstash [17:42:40] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) Can you confirm that you want the name to be changed to alert1001? thanks [17:46:30] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10Cmjohnson) [17:47:18] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:47:19] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.281 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:48:18] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:48:48] PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:52] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) asw2-a-eqiad now looks like the 3rd diagram (all leafs have at least 1 link to a spine). fpc4 is connected to fpc2/fpc6/fpc7 (removed fpc3 links) fpc5 is connected to fpc2/fpc3/... [17:49:49] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH,POST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:49:59] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,create,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:50:09] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PATCH,POST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:50:18] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2891 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:50:28] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:38] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST,PATCH,POST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:50:56] 10Operations, 10monitoring, 10netops: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) [17:50:58] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,create,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:51:19] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:28] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:51:28] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:51:28] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,create,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:51:38] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:51:39] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:39] PROBLEM - logstash log4j TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 4560: Connection refused [17:51:48] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:51:49] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:51:49] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:52:07] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Disabled the VC link between fpc4 and fpc5 to reduce the density of links (cf. T201145#4486602). [17:52:08] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:52:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received [17:52:18] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:53:00] ^ logstash1008 is me, taking a log time to restart [17:53:18] PROBLEM - HTTP on install1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:19] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:53:19] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:40] the kubernetes-api ones seem to be new checks. install1002 i am taking a look [17:53:58] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [17:53:58] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:53:58] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [17:54:28] PROBLEM - mailman archives on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:39] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:54:48] RECOVERY - logstash log4j TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4560 [17:54:58] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.011 seconds response time (No ANSWER SECTION found) [17:54:58] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [17:54:59] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:55:09] RECOVERY - HTTP on install1002 is OK: HTTP OK: HTTP/1.1 302 Moved Temporarily - 381 bytes in 4.252 second response time [17:55:17] !log performing rolling restart of logstash instances T200960 [17:55:18] RECOVERY - mailman archives on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 74815 bytes in 0.874 second response time [17:55:58] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:55:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:05] T200960: Logstash packet loss - https://phabricator.wikimedia.org/T200960 [17:56:07] install1002,fermium etc.. are ok. scb recovered [17:56:08] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [17:56:19] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:56:39] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [17:56:49] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [17:56:58] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [17:56:58] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [17:56:59] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [17:56:59] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:57:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [17:57:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:57:18] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [17:57:29] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file n [17:57:29] xistent title) timed out before a response was received [17:58:08] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [umostread] [17:58:19] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:58:39] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,pull_image} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:59:09] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:59:39] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:00:08] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received [18:00:08] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:00:08] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:00:09] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:00:18] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:00:28] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:00:29] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:00:39] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:49] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:00:58] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:00:58] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:00:58] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:00:59] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.010 seconds response time (No ANSWER SECTION found) [18:01:08] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:01:08] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:01:09] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:01:09] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:01:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:01:28] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.071 second response time [18:01:28] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.090 second response time [18:01:29] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: Status line output matched HTTP/1.1 401 - 593 bytes in 0.002 second response time [18:01:29] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [18:01:29] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:01:39] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:01:48] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74166 bytes in 0.110 second response time [18:02:08] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [18:02:58] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:59] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.015 seconds response time (No ANSWER SECTION found) [18:03:19] PROBLEM - Host cp1090 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:08] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:04:09] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:04:18] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:04:33] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10MarcoAurelio) I am not in any hurry. I am just requesting an status update. [18:04:58] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:05:10] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:05:30] 10Operations, 10Data-Services, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Bstorm) @Imarlier do you need sudo or just login access to the server? Also, should everyone in the perf-team group be included in that? [18:08:19] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:08:28] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:08:28] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1088_v4, cp1088_v6, cp1090_v4, cp1090_v6 [18:08:39] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:08:41] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10Cmjohnson) Received the new cables and swapped fpc1-fpc3 [18:08:58] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1088_v4, cp1088_v6, cp1090_v4, cp1090_v6 [18:08:59] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1088_v4, cp1088_v6, cp1090_v4, cp1090_v6 [18:09:09] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:09:59] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:18] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:10:18] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:10:29] PROBLEM - HTTP on releases1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:10:48] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:11:07] (03PS1) 10Arturo Borrero Gonzalez: dumps: give access to perf-team [puppet] - 10https://gerrit.wikimedia.org/r/451394 (https://phabricator.wikimedia.org/T201350) [18:11:20] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:49] PROBLEM - puppet last run on dns4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:09] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.011 seconds response time (No ANSWER SECTION found) [18:12:26] PROBLEM - Corp OIT LDAP Mirror on dubnium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:28] RECOVERY - HTTP on releases1001 is OK: HTTP OK: HTTP/1.1 200 OK - 18421 bytes in 1.032 second response time [18:12:28] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) [18:12:28] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:12:28] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:12:58] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:58] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:08] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:13:19] PROBLEM - puppet last run on cloudnet1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:13:30] I got a page for corp ot ldap mirror (not for anything else though) [18:13:39] PROBLEM - zotero on sca1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:48] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:30] RECOVERY - Corp OIT LDAP Mirror on dubnium is OK: LDAP OK - 0.218 seconds response time [18:14:31] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:14:31] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:15:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:15:42] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:15:50] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:16:31] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.012 seconds response time (No ANSWER SECTION found) [18:16:41] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 11%, RTA = 4353.84 ms [18:16:41] RECOVERY - zotero on sca1004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 1.036 second response time [18:16:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:17:00] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 16%, RTA = 3.94 ms [18:17:11] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,pull_image} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:17:41] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) timed out before a response was received [18:17:41] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:18:50] RECOVERY - puppet last run on mw1337 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:18:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:18:51] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:19:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [18:19:20] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:19:20] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:19:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:19:51] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:20:01] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [18:20:10] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:20:20] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:31] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:20:40] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:20:40] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.014 seconds response time (No ANSWER SECTION found) [18:21:22] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:21:31] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [18:21:41] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:21:41] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:21:50] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:01] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:11] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 4.916 second response time [18:22:20] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 624 bytes in 9.462 second response time [18:22:20] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file n [18:22:20] xistent title) timed out before a response was received [18:22:25] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@f0246f7]: Only rerender mobile-sections for wikipedia T201103 [18:22:41] PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:50] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [18:23:02] PROBLEM - Corp OIT LDAP Mirror on dubnium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:13] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:33] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:23:33] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:43] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74167 bytes in 0.591 second response time [18:23:44] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15501 bytes in 0.777 second response time [18:23:51] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@f0246f7]: Only rerender mobile-sections for wikipedia T201103 (duration: 01m 29s) [18:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:54] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:00] RECOVERY - Corp OIT LDAP Mirror on dubnium is OK: LDAP OK - 0.009 seconds response time [18:24:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:24:14] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:25:33] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:33] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:44] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:44] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:25:53] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:26:23] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.063 second response time [18:26:23] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.093 second response time [18:26:53] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:27:03] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:53] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: Status line output matched HTTP/1.1 401 - 593 bytes in 8.870 second response time [18:28:53] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:14] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:29:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:29:23] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:29:33] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:29:33] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:29:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:29:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:29:35] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:35] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:53] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:29:54] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:54] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 2 minutes ago with 15 failures. Failed resources (up to 3 shown) [18:29:59] (03CR) 10Bstorm: [C: 032] "This does what we want (checked the compiler)." [puppet] - 10https://gerrit.wikimedia.org/r/451394 (https://phabricator.wikimedia.org/T201350) (owner: 10Arturo Borrero Gonzalez) [18:30:09] (03PS2) 10Bstorm: dumps: give access to perf-team [puppet] - 10https://gerrit.wikimedia.org/r/451394 (https://phabricator.wikimedia.org/T201350) (owner: 10Arturo Borrero Gonzalez) [18:30:13] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.011 seconds response time (No ANSWER SECTION found) [18:30:24] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:30:24] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 1.271 second response time [18:30:24] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 1.275 second response time [18:30:24] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:30:24] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:30:25] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:30:25] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:31:24] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 36 failures. Last run 3 minutes ago with 36 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/set-group-write2],File[/usr/local/share/man/man1],File[/usr/share/diamond/collectors/Nutcracker/Nutcracker.py],File[/etc/apache2/conf-available/00-defaults.conf] [18:31:28] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Bstorm) In that case I'll merge the patch we figured was probably the right answer here. Since the web function can move between the two servers for failover, we... [18:31:36] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:32:04] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Bstorm) It's 1006 right now. [18:32:13] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:32:14] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled],File[/usr/local/bin/prometheus-puppet-agent-stats] [18:32:34] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:03] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received [18:33:23] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:23] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:34] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [18:33:34] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:34] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:43] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:33:43] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:33:43] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:33:43] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received [18:33:44] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:44] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:54] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [18:34:05] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:33] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [18:34:35] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:34:43] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:34:43] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:34:54] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/ppchelko],File[/home/catrope],File[/home/bsitzmann] [18:35:14] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.012 seconds response time (No ANSWER SECTION found) [18:35:14] PROBLEM - Host scb1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:23] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:33] PROBLEM - Exim SMTP on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:33] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [18:35:34] RECOVERY - cxserver endpoints health on scb2002 is OK: All endpoints are healthy [18:35:34] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:35:34] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:44] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:48] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Cmjohnson) @Marostegui swapped the disk with a new one please resolve once raid rebuilds [18:36:14] PROBLEM - Host mw1305 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:23] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:23] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:23] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:26] Wikidata search is broken [18:36:34] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:34] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 37 failures. Last run 6 minutes ago with 37 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/rsyslog.d] [18:36:43] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:36:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:36:44] PROBLEM - Host elastic1031 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:53] PROBLEM - ElasticSearch health check for shards on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.90:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.90, port=9200): Read timed out. (read timeout=4) [18:37:03] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:37:25] RECOVERY - Exim SMTP on fermium is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Sep 2018 12:40:00 PM UTC. [18:37:33] RECOVERY - Host mw1305 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:37:33] RECOVERY - Host scb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:37:43] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1621 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:37:43] RECOVERY - Host elastic1031 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:37:44] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:37:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:38:01] PROBLEM - MariaDB Slave IO: s6 on db1085 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1061.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1061.eqiad.wmnet (110 Connection timed out) [18:38:05] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Bstorm) Ok, you are good-to-go on access. The dumps are served out of /srv/dumps/xmldatadumps/public to the web. This is also an NFS server, since they both ha... [18:38:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:38:13] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:38:14] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:38:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:38:14] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:38:15] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:38:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:38:15] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:16] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [18:38:23] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74145 bytes in 1.543 second response time [18:38:23] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:33] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:33] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:34] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:38:34] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:38:43] RECOVERY - puppet last run on cloudnet1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:38:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:38:43] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.069 second response time [18:38:43] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.110 second response time [18:38:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:38:44] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [18:38:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:38:44] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:38:44] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:38:54] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:38:54] RECOVERY - ElasticSearch health check for shards on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active [18:38:54] alizing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [18:38:54] PROBLEM - puppet last run on kafka-jumbo1001 is CRITICAL: CRITICAL: Puppet has 34 failures. Last run 4 minutes ago with 34 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/rsyslog.d] [18:39:03] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:39:03] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:39:13] RECOVERY - MariaDB Slave IO: s6 on db1085 is OK: OK slave_io_state Slave_IO_Running: Yes [18:39:14] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [18:39:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [18:39:23] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:40:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:40:13] PROBLEM - puppet last run on wtp1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:23] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:40:44] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:40:54] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:13] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:24] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:25] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:41:25] PROBLEM - Host db1104 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:26] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [18:41:33] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:41:33] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:41:34] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:41:54] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:41:54] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:42:03] PROBLEM - Corp OIT LDAP Mirror on dubnium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:04] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:42:04] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:42:04] RECOVERY - Host db1104 is UP: PING WARNING - Packet loss = 54%, RTA = 0.28 ms [18:42:04] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:13] RECOVERY - puppet last run on dns4002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:42:23] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:42:23] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:42:23] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:42:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:42:23] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from G [18:42:23] L: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:42:23] PROBLEM - ElasticSearch health check for shards on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.37:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.37, port=9200): Read timed out. (read timeout=4) [18:42:33] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:34] PROBLEM - Host mw1338 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:34] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [18:42:43] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:42:44] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:42:44] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [18:42:53] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:03] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:43:14] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:14] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04253 https://grafana.wikimedia.org/dashboard/db/logstash [18:43:15] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [18:43:23] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) timed out before a response was received [18:43:24] PROBLEM - Host kafka-jumbo1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:25] PROBLEM - Host ores1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:25] PROBLEM - Host elastic1036 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:25] PROBLEM - mailman archives on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:33] PROBLEM - ElasticSearch health check for shards on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.90:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.90, port=9200): Read timed out. (read timeout=4) [18:43:34] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:34] RECOVERY - cxserver endpoints health on scb2004 is OK: All endpoints are healthy [18:43:43] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:43:44] PROBLEM - Host restbase1016 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:44] PROBLEM - Host mw1345 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:45] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [18:43:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:43:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [18:43:54] PROBLEM - Host cp1079 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:54] PROBLEM - Host cp1058 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:54] PROBLEM - Host elastic1050 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:55] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [18:44:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/references/{title}{/revision}{/tid} (G [18:44:03] test page) timed out before a response was received [18:44:03] RECOVERY - Host mw1345 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:44:10] RECOVERY - Corp OIT LDAP Mirror on dubnium is OK: LDAP OK - 1.657 seconds response time [18:44:12] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:44:12] PROBLEM - Host cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:44:12] PROBLEM - Host cr2-knams is DOWN: PING CRITICAL - Packet loss = 100% [18:44:23] RECOVERY - Host cp1079 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:44:24] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [18:44:24] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [18:44:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:44:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:44:33] RECOVERY - cxserver endpoints health on scb1003 is OK: All endpoints are healthy [18:44:34] RECOVERY - mailman archives on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 74813 bytes in 2.221 second response time [18:44:50] PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:44:51] RECOVERY - Host ores1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:44:51] PROBLEM - Host mw1341 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:56] Ok, I guess this IS server problem and not on my side. I cannot access Wikipedia :) [18:45:10] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [18:45:10] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [18:45:10] PROBLEM - SSH on ununpentium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:12] PROBLEM - Host kafka1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:12] PROBLEM - Host db1101 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:12] PROBLEM - Host mw1301 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:20] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [18:45:20] there's some network things going on Urbanecm [18:45:20] PROBLEM - Host cp1081 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:21] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:45:30] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [18:45:30] Urbanecm: I cannot either, definitelly server [18:45:35] Thank you apergos [18:45:41] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:45:41] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:45:43] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:43] PROBLEM - Exim SMTP on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:43] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:45:46] It started with just elasticsearch failing, but everything seems broken now [18:45:55] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:46:15] RECOVERY - Host mw1341 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:46:16] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:46:17] PROBLEM - Host proton1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:17] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:46:17] PROBLEM - Host eeden.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:46:25] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} ( [18:46:25] ead articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received [18:46:26] I just got “If you report this error to the Wikimedia System Administrators, please include the details below. [18:46:26] Request from 94.197.120.73 via cp3032 cp3032, Varnish XID 246220592 [18:46:26] PROBLEM - Host mc1027 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:26] Error: 503, Backend fetch failed at Wed, 08 Aug 2018 18:45:04 GMT” [18:46:27] RECOVERY - SSH on ununpentium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [18:46:35] PROBLEM - Host restbase-dev1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:35] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{ [18:46:35] eve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:46:35] RECOVERY - Host cp1058 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:46:35] RECOVERY - Host elastic1036 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:46:36] RECOVERY - Host mc1027 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:46:45] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received [18:46:45] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:46:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:47:03] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2702 bytes in 0.346 second response time [18:47:03] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:47:05] PROBLEM - Host bast3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:05] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received [18:47:15] PROBLEM - Host cp3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:15] PROBLEM - Host cp3036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:16] PROBLEM - Host nescio.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:16] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:47:17] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:26] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:47:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:47:35] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:47:35] PROBLEM - Host lvs3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] PROBLEM - Host lvs3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] PROBLEM - Host lvs3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] PROBLEM - Host multatuli.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] PROBLEM - Host lvs3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] PROBLEM - Host maerlant.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:36] RECOVERY - Host elastic1050 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:47:36] PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:45] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:47:45] RECOVERY - Host restbase-dev1005 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [18:47:46] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:47:55] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [18:47:55] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5708 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:47:56] PROBLEM - Host cp1085 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:56] RECOVERY - Host db1101 is UP: PING WARNING - Packet loss = 50%, RTA = 0.21 ms [18:47:57] RECOVERY - Host mw1338 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:48:05] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:48:05] RECOVERY - Host kafka-jumbo1004 is UP: PING OK - Packet loss = 0%, RTA = 5.51 ms [18:48:05] RECOVERY - Host kafka1003 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [18:48:05] RECOVERY - Host proton1002 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [18:48:05] RECOVERY - Host analytics1062 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [18:48:06] RECOVERY - Host cp1085 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [18:48:06] RECOVERY - Host mw1301 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:48:06] RECOVERY - Host cp1081 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [18:48:06] RECOVERY - Host restbase1016 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [18:48:07] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74177 bytes in 6.405 second response time [18:48:11] wtf [18:48:20] RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 58%, RTA = 83.63 ms [18:48:21] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [18:48:24] <_joe_> Chrissymad: we're having network troubles [18:48:27] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17288 bytes in 0.498 second response time [18:48:29] i see :P [18:48:31] We are investingating network issues [18:48:33] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.65 ms [18:48:34] RECOVERY - Exim SMTP on fermium is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Sep 2018 12:40:00 PM UTC. [18:48:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:48:34] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [18:48:35] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.34 ms [18:48:39] all caps is never good :P [18:48:43] Yep, please don't panic [18:48:45] PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:45] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:48:45] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:48:45] :) [18:48:45] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.066 second response time [18:48:46] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.046 second response time [18:48:46] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.47 ms [18:48:46] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.51 ms [18:48:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:48:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:48:55] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:48:55] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:48:55] if we can't edit, vandals cannot either :) [18:48:55] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:48:56] RECOVERY - ElasticSearch health check for shards on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.574468085 [18:48:56] 34, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [18:48:57] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [18:48:58] RECOVERY - ElasticSearch health check for shards on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.574468085 [18:48:58] 34, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [18:49:05] RECOVERY - Host cr2-knams is UP: PING OK - Packet loss = 0%, RTA = 84.54 ms [18:49:05] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.71 ms [18:49:05] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [18:49:06] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.3-3-gb047bdb891 (SSHD-CORE-1.6.0) (protocol 2.0) [18:49:06] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:11] RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.63 ms [18:49:12] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:49:12] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:12] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:49:12] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on 10.64.0.231 port 9042 [18:49:25] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:49:27] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:27] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:27] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:27] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:36] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:49:36] wow, this channel is enormously helpful. This is better than any status page [18:49:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:49:37] PROBLEM - Host cp1079 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:37] PROBLEM - puppet last run on ores1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:37] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:37] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [18:49:37] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/phaste],File[/root/.screenrc] [18:49:38] PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:45] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:49:46] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:49:55] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [18:49:55] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [18:49:55] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:55] PROBLEM - puppet last run on mw1324 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:57] PROBLEM - puppet last run on labvirt1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:57] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/swift/account.builder],File[/etc/swift/account.ring.gz],File[/etc/swift/container.builder],File[/etc/swift/container.ring.gz] [18:50:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:50:26] PROBLEM - puppet last run on ms-fe2006 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/swift/account.builder],File[/etc/swift/account.ring.gz],File[/etc/swift/container.builder],File[/etc/swift/container.ring.gz] [18:50:46] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:50:46] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:50:55] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:05] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) If you are going to store things on the dump web servers: You want f... [18:51:05] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:51:06] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:06] PROBLEM - puppet last run on db1115 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:07] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [18:51:07] RECOVERY - puppet last run on wtp1025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:51:15] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:26] RECOVERY - Host eeden.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.85 ms [18:51:36] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:45] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 3 minutes ago with 13 failures. Failed resources (up to 3 shown): File[/home/rush],File[/home/oblivian],File[/home/dzahn],File[/home/akosiaris] [18:51:56] PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:06] RECOVERY - Host lvs3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.04 ms [18:52:11] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) @Cmjohnson it failed - can you pull the disk out and then back in? We have seen that happening before. Let's give it a second chance (I assume it is one of the new disks) `... [18:52:15] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 48 connecting: cp1077_v4, cp1077_v6, cp1079_v4, cp1079_v6 [18:52:16] RECOVERY - Host bast3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.00 ms [18:52:17] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:17] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:17] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:17] PROBLEM - puppet last run on kafka-jumbo1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:17] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:25] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:26] PROBLEM - puppet last run on mw1338 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:26] RECOVERY - Host cp3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.02 ms [18:52:26] RECOVERY - Host cp3047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.24 ms [18:52:26] RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.59 ms [18:52:26] RECOVERY - Host cp3037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.56 ms [18:52:26] RECOVERY - Host cp3030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.64 ms [18:52:27] RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.97 ms [18:52:27] RECOVERY - Host cp3036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.18 ms [18:52:36] RECOVERY - Host lvs3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.06 ms [18:52:36] RECOVERY - Host lvs3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.07 ms [18:52:36] RECOVERY - Host maerlant.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.12 ms [18:52:36] RECOVERY - Host multatuli.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.42 ms [18:52:36] RECOVERY - Host lvs3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.26 ms [18:52:36] RECOVERY - Host nescio.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.43 ms [18:52:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:52:45] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [18:52:46] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:53:05] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:53:05] PROBLEM - MegaRAID on db1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [18:53:06] ACKNOWLEDGEMENT - MegaRAID on db1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T201542 [18:53:06] PROBLEM - Varnishkafka Eventlogging Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [18:53:07] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:08] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:08] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:53:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:53:08] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.8467 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:53:08] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:09] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [18:53:11] 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T201542 (10ops-monitoring-bot) [18:54:46] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [18:54:46] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) [18:54:49] 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T201542 (10Marostegui) [18:54:56] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:54:56] PROBLEM - cache_text: Varnishkafka Webrequest Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [18:54:57] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [18:54:58] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:54:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:54:59] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:54:59] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:54:59] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1088_v4, cp1088_v6, cp1090_v4, cp1090_v6 [18:55:05] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [18:56:05] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:56:05] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.01611 https://grafana.wikimedia.org/dashboard/db/logstash [18:56:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:57:05] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:57:05] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:58:05] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:58:05] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [18:58:05] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:58:15] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [18:59:06] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [18:59:06] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:59:15] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:59:15] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:59:56] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:59:57] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:59:57] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:59:57] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:59:57] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:57] RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:59:57] RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:59:58] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:59:58] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:59] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:59:59] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:00] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:00] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:01] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train - Americas version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T1900). [19:01:05] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [19:01:05] !log Waiting to deploy the train until after I've established that network issues are resolved. [19:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:15] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 16 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:01:30] It looks like things are recovering. Please let me know when it's all-clear for deployment [19:02:06] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:02:06] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [19:02:07] RECOVERY - cache_text: Varnishkafka Webrequest Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [19:02:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:02:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:02:08] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:02:58] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 2% in 5 Minutes. ``` [19:03:06] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:03:12] twentyafterfour: definitely still some ?? on our end, I'd hold for now [19:03:43] bblack: indeed, I intend to wait for an all-clear from an opsen. Thanks! [19:04:35] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:56] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:04:56] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:04:56] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:04:56] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:56] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:56] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:56] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:57] RECOVERY - puppet last run on mw1322 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:04:57] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:04:58] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:04:58] RECOVERY - puppet last run on wtp1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:59] RECOVERY - puppet last run on mw1332 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:07:06] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:07:06] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:08:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:08:28] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Imarlier) Perfect -- I'm in! And I shall be exceedingly kind to these boxes :-) Thanks much. [19:09:15] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:09:57] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:09:57] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:09:57] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:10:15] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:10:15] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1583 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:11:36] !log force puppet run where it's still failed [19:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:14:16] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03872 https://grafana.wikimedia.org/dashboard/db/logstash [19:14:57] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:14:57] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:14:57] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:14:57] RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:14:57] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:14:57] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:16:57] (03PS1) 10BBlack: depool eqiad for front-edge traffic [dns] - 10https://gerrit.wikimedia.org/r/451401 [19:17:28] (03CR) 10BBlack: [C: 032] depool eqiad for front-edge traffic [dns] - 10https://gerrit.wikimedia.org/r/451401 (owner: 10BBlack) [19:18:08] !log depooling front-edge traffic from eqiad in DNS - https://gerrit.wikimedia.org/r/c/operations/dns/+/451401 [19:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:22] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) >>! In T199252#4489377, @ArielGlenn wrote: > You want files to go under... [19:19:31] !log completed forced puppet run where it was failed [19:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:35] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on kafka-jumbo1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on mw1324 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:56] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:57] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:57] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:19:58] RECOVERY - puppet last run on ms-fe2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:20:05] RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:27:23] (03PS1) 10Zoranzoki21: Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) [19:27:36] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [19:28:02] (03PS2) 10Zoranzoki21: Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) [19:35:10] !log fix interface description on asw-a-eqiad uplinks [19:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:57] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > That sounds great to us, but we've put quite a bit of time into planning how to prevent abuse a... [19:46:16] !log rebooting cp1084 to debug eth hardware [19:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:25] RECOVERY - Host cp1084 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [19:53:16] <_joe_> !log powercycling cp1079, network card problems [19:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:34] !log reboot cp1090 for network card [19:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:57] !log reboot cp1088 for network card [19:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:26] RECOVERY - Host cp1079 is UP: PING WARNING - Packet loss = 64%, RTA = 3.38 ms [19:57:53] <_joe_> !log powercycling cp1077, network card problems [19:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:46] RECOVERY - Host cp1090 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180808T2000). [20:00:06] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 34 ESP OK [20:00:15] RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [20:00:42] Look like ORES isn't ready for a deployment today. [20:00:56] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 34 ESP OK [20:00:56] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 34 ESP OK [20:01:36] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [20:01:45] RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [20:04:26] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:54] (03PS1) 10BBlack: turn on backend_warming for eqiad caches [puppet] - 10https://gerrit.wikimedia.org/r/451410 [20:05:17] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74152 bytes in 0.299 second response time [20:06:56] (03CR) 10Giuseppe Lavagetto: [C: 031] turn on backend_warming for eqiad caches [puppet] - 10https://gerrit.wikimedia.org/r/451410 (owner: 10BBlack) [20:08:32] (03CR) 10BBlack: [C: 032] turn on backend_warming for eqiad caches [puppet] - 10https://gerrit.wikimedia.org/r/451410 (owner: 10BBlack) [20:15:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:45] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:16:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:16:46] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:16:46] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:16:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:16:47] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:16:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:57] (03Abandoned) 10Jdlrobson: Limit wgMathEnableWikibaseDataType to wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442738 (https://phabricator.wikimedia.org/T173949) (owner: 10Jdlrobson) [20:17:14] hi, im getting "Request from 94.197.121.27 via cp1089 cp1089, Varnish XID 1003454472 [20:17:14] Error: 503, Backend fetch failed at Wed, 08 Aug 2018 20:16:26 GMT" [20:17:14] Error: 503, Backend fetch failed at Wed, 08 Aug 2018 20:16:26 GMT" [20:17:46] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:18:46] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:19:05] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:20:25] (03CR) 10Bstorm: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/451181 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [20:21:56] paladox: should be ok now? [20:22:07] bblack yup, seems to work. Thanks. [20:22:46] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:22:46] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:22:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:22:55] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:23:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:23:05] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:23:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:25:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:26:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:29:55] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:32:49] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) The above changes led to a malfunction of asw2-a-eqiad starting at 17:45 UTC causing: ~35% packet loss to hosts in row A, this also impacted hosts on asw for traffic coming from... [20:32:55] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:32:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:32:55] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:33:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:36:35] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Catrope) >>! In T201409#4484469, @Krinkle wrote: > Added see also: {T193050} and {T147101} In addition to that, do w... [20:40:39] !log jforrester@deploy1001 Started deploy [mobileapps/deploy@4ef02e1]: Update mobileapps to 162ebcf [20:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:25] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [20:48:15] !log jforrester@deploy1001 Finished deploy [mobileapps/deploy@4ef02e1]: Update mobileapps to 162ebcf (duration: 07m 36s) [20:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:50] !log Updated mobileapps to 162ebcf in Production. [20:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:03] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451529 [20:55:05] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451529 (owner: 1020after4) [20:56:42] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451529 (owner: 1020after4) [20:58:20] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.16 refs T191062 [20:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:38] T191062: 1.32.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T191062 [20:59:18] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.16 refs T191062 (duration: 00m 57s) [20:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451529 (owner: 1020after4) [21:04:35] !log 1.32.0-wmf.16 appears to be stable, no noticeable increase in errors logged. [21:04:35] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [21:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:47] er [21:05:35] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [21:06:56] I'm not even sure what that's meant to mean, but the graphs it references don't even seem self-consistent :P [21:07:35] heh, great [21:07:51] the graphs I normally stare at look sane though [21:08:10] it's possibe whatever that other thing alerting is, might be confused by the eqiad edge depool, I don't know [21:10:48] yeah [21:11:16] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:11:16] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:11:32] it's being confused by eqiad depool I think (the alert briefly above), not real [21:11:35] https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1&from=now-3h&to=now [21:11:56] ^ there's a zoomed view of what it's looking at, recently. the big sweeps there are from shifting edge traffic from codfw->eqiad [21:12:05] I don't know why it picked nearly 2h after the fact to alert, but whatever [21:12:14] it's being confused by those gyrations [21:12:55] err, "shifting edge traffic from eqiad->codfw" above [21:20:41] Repeatedly getting... [21:20:45] MediaWiki internal error. [21:20:46] Original exception: [W2tekQpAAEAAACF6VnQAAABI] 2018-08-08 21:20:17: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [21:20:46] Exception caught inside exception handler. [21:20:48] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [21:21:09] url? [21:21:23] greg-g: https://en.wikipedia.org/wiki/Special:AbuseFilter/920 [21:21:29] Trying to disable the filter [21:21:42] W2teygpAADgAAA7D0YkAAABN [21:21:49] W2te6gpAAEYAABq62f8AAAAJ [21:21:55] W2te8ApAICsAACM1ytgAAABE [21:21:59] etc... [21:22:17] [W2teygpAADgAAA7D0YkAAABN] /wiki/Special:AbuseFilter/920 ErrorException from line 1154 of /srv/mediawiki/php-1.32.0-wmf.15/extensions/AbuseFilter/includes/Views/AbuseFilterViewEdit.php: PHP Warning: Invalid operand type was used: loadRequest expects array(s) [21:22:30] and there's a whole blunch [21:22:33] *bunch [21:22:48] Query: INSERT INTO `abuse_filter_action` (afa_filter,afa_consequence,afa_parameters) VALUES ('920','throttle',NULL) [21:22:49] Function: AbuseFilter::doSaveFilter [21:22:51] Error: 1048 Column 'afa_parameters' cannot be null (10.64.32.64) [21:23:28] Cyberpower678: I have no idea, but you should file a bug against abusefilter [21:23:46] I unfortunately, don't have the time to right now. [21:24:09] But it's easily replicable.. [21:24:24] Just go to the filter and try to disable it. [21:24:58] bawolff: ^ [21:28:08] we have a big jump in cross-dc traffic in the eqiad-to-codfw direction since ~21:05 that's saturating a link [21:28:13] unknown cause so far [21:28:26] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [21:28:44] could something in the recent changes have cause a bunch of new db replica or other sort of x-dc traffic? [21:29:25] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [21:31:03] bblack: nothing that I'm aware of [21:31:59] yeah it's apparently not mediawiki related, at least not directly! [21:32:10] since we just deployed to group1 I wouldn't expect anything too huge [21:33:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:36:44] perhaps revent group0 to be on the safe side and exclude it? [21:39:55] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4489550, @daniel wrote: > My point was that "must be editable and watchable" prett... [21:40:29] no I don't think it's related to the changes [21:40:42] it's just an abusive client sucking up lots of bandwidth from cache_upload media [21:57:52] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > are you just saying that pages on a central wiki seems reasonable, or that you think judgments... [22:02:06] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [22:03:06] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [22:07:29] (03PS1) 10BBlack: cacheproxy: reduce fq flow limit 1Gbps -> 400Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 [22:08:09] that alert is still just noise. I think whomever implemented it didn't think about the case where a core DC is depooled and has a highly-variable but very small amount of traffic on it (mostly monitoring/healthchecks) [22:18:16] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [22:19:54] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4490106, @daniel wrote: > How do you feel about having a public discussion on this... [22:20:16] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [22:24:33] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10Capt_Swing) Thanks @RobH and sorry for the error. I have generated two new production SSH keys, one for each of my machines. Pasting both... [22:24:45] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10Capt_Swing) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDIhnBk9gR9sJm05RY/VG3180t5MdZa0KsTa7CQDMdOwe9M5IkkfNoX4C4iB7dQg526USmhx2B81AK00P3xxyX9q... [22:26:30] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504 (10Dzahn) I fixed replication between netmon1002 and netmon2001. The check turned green now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon2001&se... [22:26:33] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10Capt_Swing) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDKGJvLuk3OarwQQOhnWz7zeB4CZmhvSRNOyAwRJHfo5lRxVvyELCy2ZYGlHNNBFNcEd6FIBkHueX1N7SwBiwyg9... [22:28:06] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10Capt_Swing) Both of these keys are new. I've also removed the rejected keys from wikitech, and will make sure I don't add these ones ther... [22:32:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:34:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:41:24] (03PS2) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) [22:41:24] (03PS1) 10Volans: Doc: uniform docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/451537 (https://phabricator.wikimedia.org/T199079) [22:41:24] (03PS1) 10Volans: Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) [22:49:38] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504 (10Dzahn) >>! In T185504#4145176, @ema wrote: > We've had the following Icinga `UNKNOWN` on netmon2001 This works now: @netmon2001:/etc/nagios/nrpe.d# cat check_postgres-rep-l... [23:04:33] no deployers around? [23:05:03] (03PS11) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [23:05:52] if we find anyone to deploy… i'd like to add another patch, for T201472. i'm still preparing the backport [23:05:53] T201472: List insertion by typing '#', '*' is broken - https://phabricator.wikimedia.org/T201472 [23:07:09] * Jhs gives it 3 more minutes before he needs to go to bed [23:10:51] I'll try again tomorrow :) [23:10:54] MaxSem: are you around? could you do the SWAT deployment? other deployers are not here, asleep, or idle for longer than you ;) [23:11:06] sorry, meeting [23:11:11] (03PS9) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [23:11:17] (03PS3) 10Krinkle: webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) [23:11:25] (with a lot of other deployers) [23:14:18] (03CR) 10Krinkle: "Thanks. Not 100% sure I got the template path right, but re-applied to beta and seems to work fine. Compiler diff at https://puppet-compil" [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [23:14:31] okay, that looks like it ends in 20 minutes? unless it's something not on your calendar. i'll wait then :) [23:19:10] (03PS3) 10Reedy: Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) (owner: 10Zoranzoki21) [23:19:19] (03CR) 10Reedy: [C: 032] Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) (owner: 10Zoranzoki21) [23:20:40] (03Merged) 10jenkins-bot: Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) (owner: 10Zoranzoki21) [23:21:26] (03CR) 10jenkins-bot: Set wgVariantArticlePath for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451403 (https://phabricator.wikimedia.org/T201545) (owner: 10Zoranzoki21) [23:22:31] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgVariantArticlePath for zhwikiversity (duration: 01m 05s) [23:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:05] RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [23:23:49] (03PS2) 10Reedy: Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 [23:23:53] (03CR) 10Reedy: [C: 032] Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 (owner: 10Reedy) [23:25:25] (03Merged) 10jenkins-bot: Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 (owner: 10Reedy) [23:27:06] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove old or same image config (duration: 00m 56s) [23:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:55] Reedy: oh, you're SWAT-deploying? you're the best [23:29:10] I'm just mashing a keyboard [23:31:06] finally.. i could not join because of the vandal protection and could not re-gain my nick due to netsplit.. i could "release" my nick but still not use it until now [23:32:19] (03PS2) 10Reedy: Convert GWToolset to extension.json etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392219 (https://phabricator.wikimedia.org/T87928) [23:35:07] (03PS3) 10Reedy: Convert GWToolset to extension.json etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392219 (https://phabricator.wikimedia.org/T87928) [23:36:35] (03CR) 10Reedy: [C: 032] Convert GWToolset to extension.json etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392219 (https://phabricator.wikimedia.org/T87928) (owner: 10Reedy) [23:37:01] (03CR) 10jenkins-bot: Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 (owner: 10Reedy) [23:37:52] (03Merged) 10jenkins-bot: Convert GWToolset to extension.json etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392219 (https://phabricator.wikimedia.org/T87928) (owner: 10Reedy) [23:38:19] Reedy: remind me how backports of commits to submodules work. do i need to submit a submodule update commit in mediawiki/extensions/VisualEditor? [23:38:28] or is it automatic? [23:38:41] i know it depends on the repo/branch but i don't know where to check how this particular repo/branch is set up [23:38:56] Oh.. [23:39:01] That's a submodule of a submodule isn't it [23:39:13] !log reedy@deploy1001 Synchronized wmf-config/extension-list: GWToolset to extension.json (duration: 00m 57s) [23:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:18] VisualEditor/VisualEditor is a submodule of mediawiki/extensions/VisualEditor [23:39:25] mediawiki/extensions/VisualEditor is a submodule of mediawiki/extensions, but i don't think that matters here [23:39:47] Jerkins hasn't merged it yet anyway [23:40:19] yeah. i guess we just wait and see [23:40:58] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: extension registration config for GWToolset (duration: 00m 56s) [23:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:49] Why isn't it merging [23:45:41] (03PS2) 10Reedy: Add proper collation for Albanian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451185 (https://phabricator.wikimedia.org/T192709) (owner: 10Jon Harald Søby) [23:45:52] (03CR) 10Reedy: [C: 032] Add proper collation for Albanian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451185 (https://phabricator.wikimedia.org/T192709) (owner: 10Jon Harald Søby) [23:46:26] Reedy: hmm, there might not be a gate-and-submit job for that repo/branch? are you able to V+2 and submit it? [23:46:39] There is for the repo [23:46:43] Might not be for the branch [23:46:58] Code-Review [23:46:58] +2 Reedy× [23:46:58] Verified [23:46:58] +2 jenkins-bot× [23:47:01] I can submit [23:47:26] (03Merged) 10jenkins-bot: Add proper collation for Albanian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451185 (https://phabricator.wikimedia.org/T192709) (owner: 10Jon Harald Søby) [23:48:32] Reedy: i don't see a commit in mw/ext/VisualEditor. i'll create one [23:48:49] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: sq* collation (duration: 00m 56s) [23:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:50] Reedy: https://gerrit.wikimedia.org/r/451545 [23:50:55] (03PS1) 10Dzahn: netbox: make the role usable on a stand-alone host again [puppet] - 10https://gerrit.wikimedia.org/r/451546 [23:52:23] (03PS2) 10Reedy: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [23:52:36] (03PS3) 10Reedy: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [23:52:41] (03CR) 10Reedy: [C: 032] Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [23:52:46] (03CR) 10jenkins-bot: Convert GWToolset to extension.json etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392219 (https://phabricator.wikimedia.org/T87928) (owner: 10Reedy) [23:52:48] (03CR) 10jenkins-bot: Add proper collation for Albanian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451185 (https://phabricator.wikimedia.org/T192709) (owner: 10Jon Harald Søby) [23:54:03] (03Merged) 10jenkins-bot: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [23:55:00] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) [23:55:53] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: satwiki sitename (duration: 01m 05s) [23:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:15] (03CR) 10Dzahn: [C: 032] "does not affect prod (http://puppet-compiler.wmflabs.org/12022/) but let's us use netbox role on cloud VPS for testing.. without also inst" [puppet] - 10https://gerrit.wikimedia.org/r/451546 (owner: 10Dzahn) [23:56:21] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451547 [23:56:24] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451547 (owner: 10Reedy) [23:56:30] (03PS2) 10Dzahn: netbox: make the role usable on a stand-alone host again [puppet] - 10https://gerrit.wikimedia.org/r/451546 [23:56:36] (03Abandoned) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451547 (owner: 10Reedy) [23:58:13] (03CR) 10Dzahn: [C: 032] "i need this for testing for T185504" [puppet] - 10https://gerrit.wikimedia.org/r/451546 (owner: 10Dzahn)