[00:07:42] (03CR) 10Dereckson: "Yes, it's the right format." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [00:09:48] (03Abandoned) 10Dereckson: Redirect m.wikipedia.org to portal [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) (owner: 10Dereckson) [00:17:35] 10Operations, 10Chinese-Sites, 10I18n: Depoly Noto fonts or their derivatives for Chinese (and J&K?) - https://phabricator.wikimedia.org/T180924#3773371 (10Arthur2e5) [00:36:13] 10Operations, 10Chinese-Sites, 10I18n: Deploy Noto fonts or their derivatives for Chinese (and J&K?) - https://phabricator.wikimedia.org/T180924#3773389 (10Platonides) [00:45:25] PROBLEM - MariaDB Slave IO: s7 on db2068 is CRITICAL: CRITICAL slave_io_state could not connect [00:45:35] PROBLEM - Check systemd state on db2068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:45:54] PROBLEM - Disk space on db2068 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [00:46:04] PROBLEM - MariaDB Slave SQL: s7 on db2068 is CRITICAL: CRITICAL slave_sql_state could not connect [00:46:17] PROBLEM - MariaDB disk space on db2068 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [00:46:19] (03PS3) 10Dereckson: Clarify header documentation for Apache redirects [puppet] - 10https://gerrit.wikimedia.org/r/285973 [00:46:27] PROBLEM - mysqld processes on db2068 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:51:37] (03CR) 10Dereckson: "Scheduled for Puppet SWAT 2017-11-28." [puppet] - 10https://gerrit.wikimedia.org/r/285973 (owner: 10Dereckson) [00:53:14] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:28:57] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 06m 26s) [02:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:24] PROBLEM - Nginx local proxy to apache on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:14] RECOVERY - Nginx local proxy to apache on mw2251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.190 second response time [03:25:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 853.23 seconds [03:48:35] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:34] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [04:02:24] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 246.25 seconds [04:24:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received [04:26:04] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [04:35:56] (03PS4) 10Jayprakash12345: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) [04:46:17] PROBLEM - MariaDB disk space on db2068 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [04:46:37] PROBLEM - mysqld processes on db2068 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:04:44] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [05:05:44] PROBLEM - Nginx local proxy to apache on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [05:06:45] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [05:06:45] RECOVERY - Nginx local proxy to apache on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time [06:15:15] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [06:15:56] !log Reboot db2068 - T180927 [06:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] T180927: db2068 storage crash - https://phabricator.wikimedia.org/T180927 [06:20:26] (03PS1) 10Marostegui: Revert "mariadb: Depool db1100, pool db1071 instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392370 [06:20:30] (03PS2) 10Marostegui: Revert "mariadb: Depool db1100, pool db1071 instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392370 [06:20:44] RECOVERY - Check systemd state on db2068 is OK: OK - running: The system is fully operational [06:21:04] RECOVERY - Disk space on db2068 is OK: DISK OK [06:21:08] RECOVERY - MariaDB disk space on db2068 is OK: DISK OK [06:22:38] (03CR) 10Marostegui: [C: 032] Revert "mariadb: Depool db1100, pool db1071 instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392370 (owner: 10Marostegui) [06:23:49] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1100, pool db1071 instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392370 (owner: 10Marostegui) [06:25:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weights for db1100 and db1071 - T180917 (duration: 00m 49s) [06:25:15] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:20] T180917: db1100 replication broken - https://phabricator.wikimedia.org/T180917 [06:25:37] RECOVERY - mysqld processes on db2068 is OK: PROCS OK: 1 process with command name mysqld [06:26:31] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1100, pool db1071 instead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392370 (owner: 10Marostegui) [06:27:24] RECOVERY - MariaDB Slave SQL: s7 on db2068 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:27:44] PROBLEM - ores on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8081: Connection refused [06:27:54] RECOVERY - MariaDB Slave IO: s7 on db2068 is OK: OK slave_io_state Slave_IO_Running: Yes [06:29:13] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [06:32:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392371 (https://phabricator.wikimedia.org/T177208) [06:33:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392371 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:34:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392371 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:36:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T177208 (duration: 00m 48s) [06:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:23] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [06:36:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392371 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:37:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:40:54] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3580 bytes in 4.054 second response time [06:41:02] !log Stop MySQL on db1082 to clone db1109 and db1110 - T180700 [06:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:10] T180700: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700 [06:44:13] ores is suddenly got crazy [06:44:20] is something wrong with scb1001? [06:45:13] < icinga-wm> PROBLEM - ores on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8081: Connection refused [06:45:19] That is from 6:27 [06:45:49] good morning! [06:46:28] marostegui: grafana alarm doesn't show any sign of recovery [06:46:28] https://grafana.wikimedia.org/dashboard/db/ores?panelId=23&fullscreen&orgId=1 [06:50:42] Amir1: I have never touched ores, but checking the processes on that host and comparing them with scb1002, it certainly looks stopped [06:50:58] okay [06:51:05] let me retstart it [06:51:11] hang on [06:51:17] looks like puppet is doing something [06:51:26] give it a se [06:51:27] c [06:51:39] looks runing now [06:51:45] can you check? [06:51:59] sure [06:52:03] marostegui: https://twitter.com/juokaz/status/931304593132224512 [06:52:21] XDDDDD [06:52:44] The processes are now runnig at least [06:53:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:53:38] And [06:53:38] root@scb1001:~# netstat -putan | grep -i listen | grep 8081 [06:53:39] tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 1196/uwsgi [06:53:48] both look fine [06:53:50] let me curl [06:57:39] curl also works fine [06:57:47] I tested it in several types [06:59:56] (03PS1) 10Marostegui: s5.hosts: Add db1109 and db1110 to s5 [software] - 10https://gerrit.wikimedia.org/r/392372 (https://phabricator.wikimedia.org/T180700) [07:00:12] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3773595 (10Marostegui) When would you like to do this? [07:01:13] Amir1: so we are good then? [07:01:39] marostegui: everything looks fine except the grafana [07:01:40] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1109 and db1110 to s5 [software] - 10https://gerrit.wikimedia.org/r/392372 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [07:02:31] (03Merged) 10jenkins-bot: s5.hosts: Add db1109 and db1110 to s5 [software] - 10https://gerrit.wikimedia.org/r/392372 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [07:06:32] Amir1: https://grafana.wikimedia.org/dashboard/db/ores?panelId=23&fullscreen&orgId=1&from=now-5m&to=now-1m this? [07:06:48] yup [07:23:03] (03PS1) 10ArielGlenn: start full dump runs on first of the month again, delay second run [puppet] - 10https://gerrit.wikimedia.org/r/392373 [07:23:56] (03CR) 10ArielGlenn: [C: 032] start full dump runs on first of the month again, delay second run [puppet] - 10https://gerrit.wikimedia.org/r/392373 (owner: 10ArielGlenn) [07:24:54] (03PS1) 10Marostegui: install_server: Reimage db1063 [puppet] - 10https://gerrit.wikimedia.org/r/392374 (https://phabricator.wikimedia.org/T180714) [07:25:24] (03PS2) 10Marostegui: install_server: Reimage db1063 [puppet] - 10https://gerrit.wikimedia.org/r/392374 (https://phabricator.wikimedia.org/T180714) [07:25:53] (03CR) 10Marostegui: [C: 032] install_server: Reimage db1063 [puppet] - 10https://gerrit.wikimedia.org/r/392374 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:27:37] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773605 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1063.eqiad.wmnet'] ``` [07:27:40] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [07:27:57] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1063.eqiad.wmnet'] ``` [07:28:13] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773608 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [07:30:50] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:40] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.023 second response time [07:35:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392377 (https://phabricator.wikimedia.org/T178359) [07:37:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392377 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:38:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392377 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:38:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392377 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:39:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T178359 (duration: 00m 49s) [07:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:49:30] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [07:51:30] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [07:52:18] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [07:52:20] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 25.61 seconds [07:52:37] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773617 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` and were **ALL** successful. [07:52:53] (03PS2) 10Muehlenhoff: base: purge apt sources on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [07:57:27] (03CR) 10Hashar: [C: 031] "+1 to do it right now. I will be on IRC in roughly half an hour and check the puppet.log output on beta." [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [07:58:28] !log installing procmail security updates [07:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:44] (03PS1) 10Marostegui: Revert "install_server: Reimage db1063" [puppet] - 10https://gerrit.wikimedia.org/r/392378 [08:05:39] (03CR) 10Marostegui: [C: 032] Revert "install_server: Reimage db1063" [puppet] - 10https://gerrit.wikimedia.org/r/392378 (owner: 10Marostegui) [08:28:22] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3773638 (10Marostegui) I don't think migrating to ROW is something we can actually do now after seeing the breakage caused when s5 master died (T180714) and there was a schema change on-going.... [08:30:28] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboot of dumps hosts - https://phabricator.wikimedia.org/T180127#3773643 (10ArielGlenn) ms1001 and dumpsdata1002 done. dataset1001 probably can be done Friday and dumpsdata1001 on Wednesday with any luck. [08:34:00] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3773646 (10Tobi_WMDE_SW) [08:46:03] !log Run mydumper for db1047.staging - T156844 [08:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:11] T156844: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [08:48:25] PROBLEM - Disk space on snapshot1007 is CRITICAL: DISK CRITICAL - free space: / 2702 MB (3% inode=92%) [08:51:56] there seems to be a ton of logs --^ [08:52:09] (wikidatadump [08:53:13] cc: apergos [08:57:01] (03PS1) 10ArielGlenn: remove check for config file existence in cli utils [dumps] - 10https://gerrit.wikimedia.org/r/392387 [08:58:08] elukey: ok, looking [08:58:42] apergos: there are ~10G files like /var/log/wikidatadump/dumpwikidatajson-wikidata-20171120-all-0.log etc.. [08:58:55] they seem a bit strange, since a ton of json is being dumped in there [09:00:56] ugh [09:01:38] I'm going to have to ping hoo about it, let me look and/or shoot as needed [09:05:45] shot them. I"ll open a ticket. we might have the same behavior with two other jobs later today, if they do the same thing I'll shoot them as well [09:06:55] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3773692 (10Qgil) [09:10:24] apergos: thanks! [09:10:35] RECOVERY - Disk space on snapshot1007 is OK: DISK OK [09:11:06] yw [09:11:10] thanks for the ping [09:16:30] (03PS1) 10Elukey: Fix import of local libraries for python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392388 [09:22:21] !log rebooting hafnium for update to 4.9.51 [09:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] (03CR) 10ArielGlenn: [C: 032] remove check for config file existence in cli utils [dumps] - 10https://gerrit.wikimedia.org/r/392387 (owner: 10ArielGlenn) [09:27:16] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3773727 (10hashar) [09:33:51] (03CR) 10Filippo Giunchedi: [C: 031] Fix import of local libraries for python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392388 (owner: 10Elukey) [09:34:48] (03PS3) 10Muehlenhoff: base: purge apt sources on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:35:59] (03CR) 10Elukey: [V: 032 C: 032] Fix import of local libraries for python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392388 (owner: 10Elukey) [09:36:16] (03CR) 10Muehlenhoff: [C: 032] base: purge apt sources on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:40:35] (03PS1) 10DCausse: Revert "[cirrus] disable token count router" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392392 (https://phabricator.wikimedia.org/T180805) [09:42:19] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10DBA, 10Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3773751 (10jcrespo) This is still valid, labsdb1006 is still not setup and labsdb1007 is a single point of failure. [09:42:28] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10DBA, 10Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3773752 (10jcrespo) a:05jcrespo>03None [09:42:37] (03PS1) 10ArielGlenn: don't produce flow history dump during partial dump runs [puppet] - 10https://gerrit.wikimedia.org/r/392393 [09:42:50] (03PS1) 10Elukey: Fix import of local libraries for python3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392394 [09:42:52] (03PS1) 10Elukey: Release version 0.2 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392395 [09:43:13] (03CR) 10Elukey: [V: 032 C: 032] Fix import of local libraries for python3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392394 (owner: 10Elukey) [09:48:28] (03PS2) 10Elukey: Release version 0.2 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392395 [09:49:31] (03CR) 10ArielGlenn: [C: 032] don't produce flow history dump during partial dump runs [puppet] - 10https://gerrit.wikimedia.org/r/392393 (owner: 10ArielGlenn) [09:50:54] (03CR) 10Elukey: [V: 032 C: 032] "elukey@boron:~$ lintian prometheus-druid-exporter_0.2-1_amd64.changes" [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392395 (owner: 10Elukey) [09:59:58] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase2004 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/392396 (https://phabricator.wikimedia.org/T179422) [10:02:42] the CI Jenkins is being rebooted [10:05:21] !log rebooting contint1001 for update to 4.9.51 [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:43] (03PS1) 10ArielGlenn: fix missing redirect for wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/392398 (https://phabricator.wikimedia.org/T180934) [10:08:28] !log contint1001: sudo systemctl start jenkins [10:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:36] moritzm: yeah Jenkins has to be manually started on reboot :) [10:08:48] ah ha [10:09:05] hosts/contint1001.yaml:profile::ci::jenkins::service_ensure: unmanaged [10:09:06] hosts/contint1001.yaml:profile::ci::jenkins::service_enable: false [10:09:10] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/392398 (https://phabricator.wikimedia.org/T180934) (owner: 10ArielGlenn) [10:09:51] (03CR) 10ArielGlenn: [C: 032] fix missing redirect for wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/392398 (https://phabricator.wikimedia.org/T180934) (owner: 10ArielGlenn) [10:11:26] (03PS1) 10Hashar: Enable jenkins on contint1001 reboot [puppet] - 10https://gerrit.wikimedia.org/r/392399 [10:11:48] moritzm: if you dont mind, lets have jenkins start up automatically when the machine boot https://gerrit.wikimedia.org/r/392399 Enable jenkins on contint1001 reboot :D [10:11:52] (03PS1) 10Jcrespo: mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) [10:12:09] hashar: definitely, having a look shortly [10:12:12] moritzm: the service is unmanaged to prevent puppet from bringing it up when we manually stopped it [10:12:28] but on a machine reboot, jenkins should start. Else one might forget to bring it back up [10:17:03] hashar: yeah, but if we explicitly stop it for maintenance we can simply use "systemctl mask jenkins.service", that would even prevent a restart after a reboot (until it's lifted with "systemctl unmask jenkins.service" [10:18:14] (03PS2) 10Jcrespo: mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) [10:19:25] !log elastic/cirrus: reindexing english group0 and group1 wikis: T179945 [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:32] T179945: Re-index English-language wikis to pick up kana mapping - https://phabricator.wikimedia.org/T179945 [10:20:20] (03CR) 10Jcrespo: "Aside from the changes stated, I made some changes to some lab* hosts, please do a sanity check that I did not broke anything. I tried to " [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:23:19] !log Manually re-started the Wikidata entity JSON dump on snapshot1007 (T180934) [10:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] T180934: Wikidata json dumps filling /var/log - https://phabricator.wikimedia.org/T180934 [10:25:17] (03CR) 10Mobrovac: [C: 031] cassandra: reprovision restbase2004 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/392396 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [10:35:24] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3773861 (10fgiunchedi) restbase2006 is fully bootstrapped and we're ready to continue with restbase2004. I've upgraded the hp raid controller to firmware version `6.06(B)`. The other co... [10:39:35] !log rebooting mwlog2001 for update to 4.9.51 [10:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:20] !log rebooting mwlog1001 for update to 4.9.51 [10:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:00] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase2004 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/392396 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [10:51:07] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase2004 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/392396 (https://phabricator.wikimedia.org/T179422) [10:51:41] !log rebooting cerium/praseodymium/xenon for update to 4.9.51 [10:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:46] (03PS7) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [10:57:48] !log reimage restbase2004 - T179422 [10:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:55] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [10:59:18] 10Operations, 10ops-codfw, 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773892 (10jcrespo) 05Resolved>03Open Maybe related: T102236 We need a BIOS upgrade and the HW logs. [11:01:31] 10Operations, 10ops-codfw, 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773901 (10Marostegui) a:05Marostegui>03Papaul @Papaul can you help us with the BIOS upgrade? @jcrespo there were no HW logs from the crash, there are only the typical ones AFTER the crash that doesn't say... [11:03:31] (03CR) 10Volans: [C: 032] Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:06:45] (03PS1) 10Marostegui: install_server: Install db1109, db1110 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/392403 (https://phabricator.wikimedia.org/T180700) [11:08:38] (03CR) 10Marostegui: [C: 032] install_server: Install db1109, db1110 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/392403 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [11:10:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1110.eqiad.wmnet ``` The log can be found in `/var/l... [11:10:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1109.eqiad.wmnet ``` The log can be found in `/var/l... [11:13:49] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, and 2 others: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3773924 (10phuedx) >>! In T180639#3764846, @RobH wrote: > Additionally, you should have your manager approve your expansions of rights... [11:13:56] !log rebooting ruthenium for update to 4.9.51 [11:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:25] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:01] that's me, fixing (damn typo) [11:16:44] 10Operations, 10ops-codfw, 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773925 (10jcrespo) From the "health log": ``` 4 Critical Drive Array 11/20/2017 00:33 06/10/2015 16:05 2 Drive Array Controller Failure (Slot 0) ``` [11:18:14] (03PS1) 10Volans: graphite: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/392405 (https://phabricator.wikimedia.org/T170353) [11:18:36] (03CR) 10jerkins-bot: [V: 04-1] graphite: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/392405 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:19:34] (03PS2) 10Volans: graphite: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/392405 (https://phabricator.wikimedia.org/T170353) [11:20:25] (03CR) 10Volans: [C: 032] graphite: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/392405 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:21:11] (03PS2) 10Filippo Giunchedi: prometheus: add mtail/exim jobs [puppet] - 10https://gerrit.wikimedia.org/r/392039 (https://phabricator.wikimedia.org/T179565) [11:22:46] !log ppchelko@tin Started deploy [cpjobqueue/deploy@bdcef23]: Various performance improvements in committing [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:16] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@bdcef23]: Various performance improvements in committing (duration: 00m 30s) [11:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:24] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:27:13] !log rebooting restbase-test cluster for update to 4.9.51 [11:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773962 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1110.eqiad.wmnet'] ``` and were **ALL** successful. [11:32:39] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add mtail/exim jobs [puppet] - 10https://gerrit.wikimedia.org/r/392039 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [11:34:02] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773973 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1109.eqiad.wmnet'] ``` and were **ALL** successful. [11:36:12] !log rebooting pybal-test for update to 4.9.51 [11:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:28] (03PS1) 10Alexandros Kosiaris: servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) [11:39:52] (03CR) 10jerkins-bot: [V: 04-1] servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [11:40:09] (03PS1) 10Filippo Giunchedi: prometheus: fix relabeling for redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/392407 (https://phabricator.wikimedia.org/T148637) [11:41:01] !log rebooting etherpad1001 (etherpad.wikimedia.org) for update to 4.9.51 [11:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:29] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix relabeling for redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/392407 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [11:41:34] (03PS2) 10Filippo Giunchedi: prometheus: fix relabeling for redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/392407 (https://phabricator.wikimedia.org/T148637) [11:43:53] (03PS7) 10Volans: Icinga notification: use notes_url in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [11:43:55] (03PS8) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [11:43:57] (03PS1) 10Volans: icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) [11:43:59] (03PS1) 10Volans: icinga: remove display_name [puppet] - 10https://gerrit.wikimedia.org/r/392409 (https://phabricator.wikimedia.org/T170353) [11:46:16] !log rebooting tungsten for update to 4.9.51 [11:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:56] (03PS1) 10Volans: netbox: rename duplicate Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/392410 [11:50:07] (03CR) 10Volans: "For reference see notes_url in https://www.icinga.com/docs/icinga1/latest/en/objectdefinitions.html#service" [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:00:02] (03PS2) 10Alexandros Kosiaris: servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) [12:00:25] (03CR) 10jerkins-bot: [V: 04-1] servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [12:05:12] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774043 (10alanajjar) @Marostegui Are you here now? [12:11:20] (03PS1) 10Elukey: Fix decoding of JSON data in python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392412 [12:13:18] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:13:22] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: remove display_name [puppet] - 10https://gerrit.wikimedia.org/r/392409 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:13:29] (03PS2) 10Elukey: Fix decoding of JSON data in python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392412 [12:17:21] !log uploaded hhvm 3.18.5+dfsg-1+wmf1+icu57 to apt.wikimedia.org (jessie-wikimedia/component/icu57) (HHVM build linked against a co-installable backport of icu57) [12:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:07] (03CR) 10Volans: [C: 031] "LGTM" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392412 (owner: 10Elukey) [12:19:24] (03PS2) 10Volans: icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) [12:22:40] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774102 (10Marostegui) yes, go ahead if you like Please paste the progress URL so we can check it too [12:24:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774106 (10alanajjar) Thanks @Marostegui [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/KnudW|The progress]] [12:28:01] (03PS3) 10Alexandros Kosiaris: servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) [12:28:29] (03CR) 10jerkins-bot: [V: 04-1] servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [12:32:19] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944#3774115 (10akosiaris) [12:33:05] (03PS3) 10Volans: icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) [12:33:07] (03PS2) 10Volans: icinga: remove display_name [puppet] - 10https://gerrit.wikimedia.org/r/392409 (https://phabricator.wikimedia.org/T170353) [12:33:10] (03PS8) 10Volans: Icinga notification: use notes_url in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [12:33:11] (03PS9) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [12:34:31] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774129 (10Marostegui) db1063 has been rebuilt and it is now catching up. Before putting it back as vslow, I am going to optimize wb_terms table as we have been doing... [12:34:40] (03PS4) 10Alexandros Kosiaris: servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) [12:35:13] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774130 (10Marostegui) [12:36:21] (03PS4) 10Volans: icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) [12:37:54] (03CR) 10Volans: [C: 032] icinga: convert display_name in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/392408 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:40:11] !log Optimize db1063 wikidatawiki.wb_terms [12:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] (03CR) 10Alexandros Kosiaris: [C: 032] "Rubocop's happy, I did some live testing+debugging on puppetmaster2001, everything seemed fine. Deploying with fingers crossed." [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [12:43:22] (03PS5) 10Alexandros Kosiaris: servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) [12:43:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon.rb: Parse puppet.conf [puppet] - 10https://gerrit.wikimedia.org/r/392406 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [12:53:15] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774176 (10Marostegui) The scope of this ticket is actually all done. So I would suggest we close it and do any amends or follow ups taken from the IR: https://wikitec... [12:57:10] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [12:59:44] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774192 (10Steinsplitter) [13:00:22] (cc: marostegui) [13:01:02] Steinsplitter: I am not the one in charge of renames :-) [13:01:36] I happen to have time now, but I don't want to be the blocker and/or the one giving green lights here ;-) [13:02:03] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774215 (10Steinsplitter) a:05Marostegui>03None [13:02:10] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:02:24] (03PS1) 10Alexandros Kosiaris: Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) [13:02:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [13:02:57] (03PS2) 10Alexandros Kosiaris: Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) [13:03:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [13:04:15] marostegui: oh okay :) who is doing that stuff now? lego seems busy. [13:04:35] (03CR) 10Elukey: [V: 032 C: 032] Fix decoding of JSON data in python3 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392412 (owner: 10Elukey) [13:04:46] Steinsplitter: There is a discussion about that actually going on: https://phabricator.wikimedia.org/T180903 [13:05:47] oh, thanks. :) [13:09:14] dereckson: can you please add my task to T169440? i have no idea how to. thanks :) [13:09:15] T169440: Pending global renames in need of sysadmin supervision (tracking) - https://phabricator.wikimedia.org/T169440 [13:13:27] (03PS1) 10Elukey: Fix decoding of JSON data in python3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392417 [13:13:29] (03PS1) 10Elukey: Release version 0.3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392418 [13:14:07] (03CR) 10Elukey: [V: 032 C: 032] Fix decoding of JSON data in python3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392417 (owner: 10Elukey) [13:23:24] (03CR) 10Elukey: [V: 032 C: 032] Release version 0.3 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392418 (owner: 10Elukey) [13:23:43] (03CR) 10Hashar: "Seems to me profile::ci::firewall should require profile::base::firewall. To make sure the later is always added?" [puppet] - 10https://gerrit.wikimedia.org/r/391742 (owner: 10Dzahn) [13:30:18] !log upload prometheus-druid-exporter 0.3 to stretch-wikimedia [13:30:23] godog: --^ \o/ [13:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:57] (03PS3) 10Elukey: role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) [13:34:29] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774277 (10Marostegui) This looks done, feel free to resolve and start the other one if you like. [13:38:54] (03PS3) 10Volans: icinga: remove display_name [puppet] - 10https://gerrit.wikimedia.org/r/392409 (https://phabricator.wikimedia.org/T170353) [13:39:23] !log ppchelko@tin Started deploy [cpjobqueue/deploy@174420f]: Optimise committed offset calculation [13:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:50] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@174420f]: Optimise committed offset calculation (duration: 00m 28s) [13:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:15] (03CR) 10Volans: [C: 032] icinga: remove display_name [puppet] - 10https://gerrit.wikimedia.org/r/392409 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [13:48:42] (03CR) 10Marostegui: [C: 031] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [13:49:49] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774331 (10alanajjar) @Marostegui Thanks for help. I prefer to wait @Linedwell to preform T180903 because it's a request on GlobalRenameQueue and I don't n... [13:49:54] (03PS1) 10Arturo Borrero Gonzalez: apt: add class apt::dpkg-confold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [13:50:02] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774335 (10alanajjar) 05Open>03Resolved [13:50:19] (03CR) 10jerkins-bot: [V: 04-1] apt: add class apt::dpkg-confold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [13:50:29] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774340 (10Marostegui) cool thanks [13:51:52] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774344 (10Steinsplitter) [13:53:49] !log disable puppet on scb100x and stop cpjobqueue to accumulate some backlog [13:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:07] (03PS3) 10Rush: labstore: fix rsync rule for misc [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) [13:55:17] (03CR) 10Elukey: [C: 032] role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [13:55:23] (03PS4) 10Elukey: role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) [13:57:06] (03PS4) 10Rush: labstore: fix rsync rule for misc [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) [13:57:31] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944#3774355 (10akosiaris) This is probably related. From icinga `HTTP CRITICAL - Invalid HTTP response received from host on por... [13:59:59] (03PS1) 10Elukey: role::druid::public::worker: add prometheus druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392422 (https://phabricator.wikimedia.org/T177459) [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171120T1400). [14:00:05] Zoranzoki21, Jayprakash12345, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:22] (03CR) 10Elukey: [C: 032] role::druid::public::worker: add prometheus druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392422 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [14:00:26] o/ [14:00:53] Zoranzoki21, Jayprakash12345: around for EU SWAT? [14:01:15] dcausse: want to deploy your own commits? [14:01:45] zeljkof: sure [14:02:01] dcausse: go ahead, since the rest of the people are not around :) [14:02:44] zeljkof: ok deploying [14:04:57] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392392 (https://phabricator.wikimedia.org/T180805) (owner: 10DCausse) [14:05:06] !log upload prometheus-druid-exporter 0.3 to jessie-wikimedia [14:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:35] (03PS2) 10Arturo Borrero Gonzalez: apt: add class apt::dpkg-confold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [14:06:15] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-druid-exporter] [14:06:27] (03Merged) 10jenkins-bot: Revert "[cirrus] disable token count router" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392392 (https://phabricator.wikimedia.org/T180805) (owner: 10DCausse) [14:06:50] the puppet failures on druid are due to me, fixing them now [14:06:57] (03CR) 10jenkins-bot: Revert "[cirrus] disable token count router" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392392 (https://phabricator.wikimedia.org/T180805) (owner: 10DCausse) [14:08:41] (03PS3) 10Arturo Borrero Gonzalez: apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [14:08:45] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:08:55] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:05] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:26] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:27] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944#3774376 (10herron) Alternatively updating the check to include `-u /puppet/v3` works. My thought was to update our Icinga ch... [14:09:32] checking scb [14:09:42] (03PS10) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [14:09:43] cpjobqueue.service fails [14:10:11] (03PS5) 10Rush: labstore: fix rsync rule for misc [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) [14:10:16] mmmm "Puppet is disabled. Stop cpjobqueue for backlog accumulation" [14:10:44] https://gerrit.wikimedia.org/r/#/c/392312/ [14:10:55] Who will merge [14:11:15] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:12:04] (03PS11) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [14:12:33] Who will SWAT today? [14:14:06] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: T180805: Revert [cirrus] disable token count router (duration: 00m 49s) [14:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:14] T180805: Re-enable the token count router - https://phabricator.wikimedia.org/T180805 [14:14:45] elukey moritzm !log disable puppet on scb100x and stop cpjobqueue to accumulate some backlog [14:15:06] yep yep I followed up with Peter [14:15:09] thanks :) [14:15:33] zeljkof: I'm done [14:15:36] paladox: thanks, missed that [14:15:43] your welcome :) [14:16:02] dcausse: great! [14:16:21] Zoranzoki21, Jayprakash12345: around for EU SWAT? [14:16:29] I can SWAT today [14:16:30] yeah [14:17:23] Jayprakash12345: ok, your commits are next [14:17:35] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [14:17:45] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [14:17:55] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [14:17:57] Jayprakash12345: please stand by, I will let you know when each commit is at mwdebug1002, ready for testing [14:18:05] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [14:18:22] zeljkof: ok [14:19:32] (03CR) 10Zfilipin: "Scheduled for EU SWAT by Zoranzoki21, but since he is not around, it will not be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [14:20:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) (owner: 10Jayprakash12345) [14:21:56] (03Merged) 10jenkins-bot: Enable wgNamespacesWithSubpages for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) (owner: 10Jayprakash12345) [14:22:06] (03CR) 10jenkins-bot: Enable wgNamespacesWithSubpages for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) (owner: 10Jayprakash12345) [14:22:12] !log enable semi-sync replication on s5 [14:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:52] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944#3774410 (10akosiaris) Yeah, after a brief discussion I 've had, it turns out the culprit is probably the puppet 3 rack compat... [14:23:12] Jayprakash12345: 392312 is at mwdebug1002, please test and let me know if I can deploy [14:23:23] (03PS5) 10Zfilipin: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [14:23:30] ok [14:24:25] zeljkof: tested, please go ahead [14:24:34] Jayprakash12345: deploying... [14:25:52] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:392312|Enable wgNamespacesWithSubpages for hiwikiversity (T180913)]] (duration: 00m 48s) [14:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] T180913: Enable Sub-pages in the main namespace for hiwikiversity - https://phabricator.wikimedia.org/T180913 [14:26:03] Jayprakash12345: deployed, please check [14:28:28] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [14:28:39] zeljkof: now everthing is fine please go for seond patch [14:29:05] Jayprakash12345: ok, already reviewed it, waiting for Jenkins to merge it [14:29:40] (03Merged) 10jenkins-bot: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [14:29:53] (03CR) 10jenkins-bot: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [14:31:47] Jayprakash12345: 391756 is at mwdebug1002, please test and let me know if I can deploy [14:32:06] ok [14:33:47] zeljkof: tested, please go ahead [14:34:01] Jayprakash12345: deploying... [14:34:03] (03CR) 10Muehlenhoff: apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [14:34:54] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:391756|Enable Single edit tab in Catalan Wikipedia (T180660)]] (duration: 00m 49s) [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:01] T180660: Enable Single edit tab in Catalan Wikipedia - https://phabricator.wikimedia.org/T180660 [14:35:02] Jayprakash12345: deployed, please check [14:35:21] (03PS1) 10Herron: icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) [14:35:45] (03CR) 10jerkins-bot: [V: 04-1] icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) (owner: 10Herron) [14:36:59] Zoranzoki21: final call, around for EU SWAT? [14:37:21] zeljkof: Thanks [14:37:30] Jayprakash12345: all good? [14:37:53] (03PS2) 10Herron: icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) [14:37:55] yeah [14:38:15] Jayprakash12345: thanks for releasing with #releng! ;) [14:38:34] anything else to deploy during EU SWAT? [14:39:20] nothing new in the calendar [14:39:24] !log EU SWAT finished [14:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] (03PS13) 10Muehlenhoff: labstore: initial ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [14:48:01] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8863/" [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:50:06] (03PS1) 10Elukey: Remove incomplete query/node/* metrics [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392424 (https://phabricator.wikimedia.org/T177459) [14:58:32] (03PS3) 10Herron: Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [14:58:51] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [14:59:24] (03CR) 10Herron: [V: 032 C: 032] Revert "Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001"" [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [15:00:22] (03PS4) 10Herron: puppet: point codfw mw systems at puppet 4 master puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [15:01:24] (03CR) 10Herron: [C: 032] puppet: point codfw mw systems at puppet 4 master puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [15:01:29] (03PS5) 10Herron: puppet: point codfw mw systems at puppet 4 master puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/392416 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [15:03:31] !log pointing codfw mw servers at codfw puppet 4 masters via puppetmaster2001 [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:47] 10Operations, 10MediaWiki-Containers: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696#3774599 (10hashar) [15:22:09] (03CR) 10Hashar: "recheck" [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/392037 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [15:24:37] (03PS2) 10Hashar: sample uwsgi app that would produce json status output for dumps [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/335007 (https://phabricator.wikimedia.org/T147177) (owner: 10ArielGlenn) [15:24:51] (03Abandoned) 10Hashar: Add tox and pass flake8 [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/392037 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [15:25:31] (03CR) 10Hashar: "Amended with the content of https://gerrit.wikimedia.org/r/#/c/392037/ "Add tox and pass flake8". And I have setup CI for this repository " [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/335007 (https://phabricator.wikimedia.org/T147177) (owner: 10ArielGlenn) [15:26:37] (03PS17) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [15:26:45] (03PS5) 10Paladox: Gerrit: Enable logstash for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/392083 (https://phabricator.wikimedia.org/T141324) [15:30:51] !log ppchelko@tin Started deploy [cpjobqueue/deploy@f0610d3]: Temporary set consumer_batch_size to 50 [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:22] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@f0610d3]: Temporary set consumer_batch_size to 50 (duration: 00m 30s) [15:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:28] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:18] !log disable puppet for a merge across cloud things [15:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:27] (03PS1) 10Muehlenhoff: role::labs::nfs::secondary: Add Ferm rules for DRBD [puppet] - 10https://gerrit.wikimedia.org/r/392430 (https://phabricator.wikimedia.org/T165136) [15:35:47] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:35] (03CR) 10Ottomata: [C: 031] "Cool! FYI kafka-jumbo is still not yet prod ready. We are waiting to set up TLS for broker communication, which is blocked on the cergen" [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey) [15:38:07] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:12] (03PS1) 10Jcrespo: mariadb: Depool db2068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392431 (https://phabricator.wikimedia.org/T180927) [15:39:21] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392431 (https://phabricator.wikimedia.org/T180927) (owner: 10Jcrespo) [15:39:58] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:40:17] (03CR) 10Rush: [C: 032] openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:40:27] (03PS12) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [15:40:34] (03Merged) 10jenkins-bot: mariadb: Depool db2068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392431 (https://phabricator.wikimedia.org/T180927) (owner: 10Jcrespo) [15:40:43] (03CR) 10jenkins-bot: mariadb: Depool db2068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392431 (https://phabricator.wikimedia.org/T180927) (owner: 10Jcrespo) [15:40:52] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392432 (https://phabricator.wikimedia.org/T128546) [15:41:57] ^ me [15:42:29] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2068 (duration: 00m 49s) [15:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:08] !log shutting down db2068 for maintenance after depool T180927 [15:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:15] T180927: db2068 storage crash - https://phabricator.wikimedia.org/T180927 [15:45:25] (03PS4) 10Ottomata: Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) [15:45:30] (03CR) 10Ottomata: [V: 032 C: 032] Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:47:17] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3774682 (10RobH) Since it is a deploy and sudo,... [15:48:08] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:50:48] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:27] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) (owner: 10Herron) [15:52:01] (03PS1) 10Muehlenhoff: Add component/git for jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/392436 [15:54:57] (03PS1) 10Alexandros Kosiaris: postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 [15:54:58] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:59] (03PS1) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [15:55:23] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330#3774696 (10hashar) [15:55:54] (03PS8) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [15:56:02] (03PS3) 10Elukey: profile::mariadb::misc::eventlogging:replication: add EL sanitization cron [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) [15:56:48] (03CR) 10Elukey: "changed the flock command to avoid the -c parameter, it seems not needed." [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:57:49] (03PS9) 10Ottomata: EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [15:57:51] (03CR) 10Ottomata: "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:58:23] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330#3774712 (10hashar) [15:59:01] (03PS1) 10Rush: openstack: labs_tld is still used by instances [puppet] - 10https://gerrit.wikimedia.org/r/392440 (https://phabricator.wikimedia.org/T171494) [15:59:27] (03PS2) 10Rush: openstack: labs_tld is still used by instances [puppet] - 10https://gerrit.wikimedia.org/r/392440 (https://phabricator.wikimedia.org/T171494) [15:59:35] !log Use lz4 compression instead of deflate (T180804) [15:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:43] T180804: Reconfigure deflate compressed keyspaces to use LZ4 - https://phabricator.wikimedia.org/T180804 [16:00:07] (03CR) 10Rush: [C: 032] openstack: labs_tld is still used by instances [puppet] - 10https://gerrit.wikimedia.org/r/392440 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:01:01] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8868/" [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [16:01:14] I am? huh... I mean, it's fine, but I didn't actually get asked or sign up [16:01:23] * apergos goes to check general phab tickets somewhat late [16:02:38] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180211#3774724 (10Papaul) @Joe or @akosiaris since this systems is using software raid there is nothing showing in hardware log which disk is bad and the system diagnostic came up with no error. Can you please pull up at... [16:02:45] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3774726 (10ArielGlenn) p:05Triage>03Normal [16:02:56] chasemp: I can confirm puppet is fixed on instances ( re labs_tld) [16:03:20] tx hashar [16:06:29] !log ppchelko@tin Started deploy [cpjobqueue/deploy@f0610d3]: Revert: Temporary set consumer_batch_size to 50 [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:46] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@f0610d3]: Revert: Temporary set consumer_batch_size to 50 (duration: 00m 17s) [16:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] (03CR) 10Ottomata: [C: 031] Remove incomplete query/node/* metrics [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392424 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [16:21:10] (03PS2) 10Alexandros Kosiaris: postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 [16:21:12] (03PS2) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [16:21:14] (03PS1) 10Alexandros Kosiaris: Add postgresql::prometheus class to user of postgresql [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [16:21:53] (03CR) 10jerkins-bot: [V: 04-1] Add postgresql::prometheus class to user of postgresql [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [16:24:27] !log Add db1109 and db1110 to tendril - T180700 [16:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:34] T180700: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700 [16:26:28] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:06] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180211#3774845 (10akosiaris) No this is not a false alarm. One of the 2 disks has indeed failed and it seems so badly that the system can not even probe it anymore. What I could do is find out the serial number of the non... [16:28:38] (03PS1) 10Hashar: Address incompatbile-pointer-types [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/392442 [16:29:03] (03CR) 10Hashar: "Untested! :)" [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/392442 (owner: 10Hashar) [16:30:43] (03PS1) 10Hashar: gcc warning are now fatals (-Werror) [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/392443 [16:31:00] (03CR) 10Hashar: "The sole warniing is fixed by https://gerrit.wikimedia.org/r/#/c/392442/" [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/392443 (owner: 10Hashar) [16:32:04] !log ppchelko@tin Started deploy [cpjobqueue/deploy@f0610d3]: Revert: Temporary set consumer_batch_size to 50, forgot -f [16:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:32] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@f0610d3]: Revert: Temporary set consumer_batch_size to 50, forgot -f (duration: 00m 28s) [16:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:04] (03CR) 10Elukey: [V: 032 C: 032] Remove incomplete query/node/* metrics [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392424 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [16:34:01] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1109 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392444 (https://phabricator.wikimedia.org/T180700) [16:34:59] (03PS3) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [16:35:01] (03PS2) 10Alexandros Kosiaris: Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [16:35:39] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1109 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392444 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [16:35:41] (03CR) 10jerkins-bot: [V: 04-1] Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [16:36:08] !log ppchelko@tin Started deploy [cpjobqueue/deploy@174420f]: Revert: Temporary set consumer_batch_size to 50, forgot -f, checkout prev rev [16:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:50] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1109 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392444 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [16:37:04] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1109 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392444 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [16:38:14] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1109 and db1110 to the config depooled - T180700 (duration: 00m 48s) [16:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:22] T180700: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700 [16:39:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1109 and db1110 to the config depooled - T180700 (duration: 00m 48s) [16:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:41] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@174420f]: Revert: Temporary set consumer_batch_size to 50, forgot -f, checkout prev rev (duration: 03m 33s) [16:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:17] !log Reboot db1082 for kernel upgrade [16:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:55] (03CR) 10Andrew Bogott: puppet: Move cloud VMs to the puppet 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/392172 (https://phabricator.wikimedia.org/T178508) (owner: 10Andrew Bogott) [16:47:00] (03PS2) 10Muehlenhoff: Add component/git for jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/392436 [16:47:02] (03PS2) 10Andrew Bogott: puppet: Move cloud VMs to the puppet 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/392172 (https://phabricator.wikimedia.org/T178508) [16:49:02] (03CR) 10Andrew Bogott: [C: 032] puppet: Move cloud VMs to the puppet 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/392172 (https://phabricator.wikimedia.org/T178508) (owner: 10Andrew Bogott) [16:49:28] (03CR) 10Muehlenhoff: [C: 032] Add component/git for jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/392436 (owner: 10Muehlenhoff) [16:49:34] (03PS3) 10Muehlenhoff: Add component/git for jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/392436 [16:54:39] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning with a given provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning without specifying a provider) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /_info/home (redire [16:54:39] ) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [16:56:04] (03PS17) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [16:56:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3774972 (10Papaul) The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, there is a new Service pack ISO on the HP site t... [16:56:37] (03PS2) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [16:56:38] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [16:59:39] those are some verbose alerts :) [17:00:03] !log uploaded git 2.11-3+deb9u2+bpo8+wmf1 for component/git to apt.wikimedia.org/jessie-wikimedia [17:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:01] (03CR) 10Anomie: [C: 04-1] Remove overlapping userrights (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:09:35] (03PS3) 10Alexandros Kosiaris: postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 [17:09:37] (03PS4) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [17:09:39] (03PS3) 10Alexandros Kosiaris: Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [17:10:16] (03CR) 10jerkins-bot: [V: 04-1] Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [17:11:32] (03PS18) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [17:12:30] (03CR) 10TerraCodes: [C: 031] Remove overlapping userrights (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:12:46] (03CR) 10jerkins-bot: [V: 04-1] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:13:32] (03PS1) 10Muehlenhoff: deployment servers: Switch to component/git [puppet] - 10https://gerrit.wikimedia.org/r/392447 [17:13:45] (03CR) 10TerraCodes: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:14:52] (03CR) 10jerkins-bot: [V: 04-1] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:16:44] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3775075 (10Andrew) [17:16:46] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3775076 (10Andrew) [17:16:48] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Switch Toolforge project hosts to the future parser - https://phabricator.wikimedia.org/T177298#3775072 (10Andrew) 05Open>03Resolved a:03Andrew Done with https://gerrit.wikimedia.org/r/#/c/392172/ [17:19:04] (03PS19) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [17:19:57] !log Sending Toolforge survey emails from silver for T177126 [17:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:04] T177126: 2017 Toolforge user survey - https://phabricator.wikimedia.org/T177126 [17:20:14] (03CR) 10jerkins-bot: [V: 04-1] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:20:38] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8873/" [puppet] - 10https://gerrit.wikimedia.org/r/392168 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:20:45] (03PS2) 10Rush: openstack: openstack2 => openstack [puppet] - 10https://gerrit.wikimedia.org/r/392168 (https://phabricator.wikimedia.org/T171494) [17:23:39] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:43] (03CR) 10Anomie: Remove overlapping userrights (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:24:18] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [17:24:18] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received [17:24:28] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [17:24:38] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [17:24:39] (03CR) 10Anomie: Remove overlapping userrights (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [17:25:09] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:25:29] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.036 second response time [17:25:34] 10Operations, 10Discovery, 10Wikimedia-Mailing-lists, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Email list needed for automating the Wikipedia.org portal - https://phabricator.wikimedia.org/T180976#3775108 (10debt) [17:25:42] 10Operations, 10Discovery, 10Wikimedia-Mailing-lists, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Email list needed for automating the Wikipedia.org portal - https://phabricator.wikimedia.org/T180976#3775121 (10debt) [17:25:45] (03PS20) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [17:32:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3775124 (10dduvall) Since we hadn't actually released 0.2.0 and there weren't any changes other than Debian package related ones, I moved the exist... [17:32:09] (03PS1) 10Lucas Werkmeister (WMDE): Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 [17:35:53] (03CR) 10Lucas Werkmeister (WMDE): "All of the mentioned changes are also in wmf.6 of the Wikidata build, which is the currently deployed version." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [17:36:10] (03PS1) 10Marostegui: db-eqiad.php: Pool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392450 (https://phabricator.wikimedia.org/T177208) [17:53:00] (03PS3) 10RobH: Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) (owner: 10Mholloway) [17:53:27] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, and 2 others: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3775168 (10RobH) This was approved in today's operations team meeting. I'll merge the access shortly. [17:53:31] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3775169 (10RobH) This was approved in today's op... [17:53:43] (03CR) 10RobH: [C: 032] Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) (owner: 10Mholloway) [17:54:12] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180211#3775171 (10Papaul) @akosiaris thank you. [17:55:16] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3775173 (10RobH) 05Open>03Resolved a:03RobH... [17:55:30] (03PS2) 10RobH: adding jdrewniak to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/391732 (https://phabricator.wikimedia.org/T180639) [17:55:45] (03CR) 10RobH: [C: 032] adding jdrewniak to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/391732 (https://phabricator.wikimedia.org/T180639) (owner: 10RobH) [17:56:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392450 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [17:56:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:54] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, and 2 others: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3775182 (10RobH) 05Open>03Resolved a:03RobH This access change is now live on the puppetmasters. It will take up to 30 minutes f... [17:57:29] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392450 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [17:57:38] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392450 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [17:58:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with low weight - T177208 (duration: 00m 49s) [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:38] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [17:59:16] 10Operations, 10Discovery, 10Wikimedia-Mailing-lists, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Email list needed for automating the Wikipedia.org portal - https://phabricator.wikimedia.org/T180976#3775108 (10RobH) #debt: Did you want to be the only list administrator (and use your @wikimedia.org a... [17:59:28] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [18:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171120T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:03:08] (03PS2) 10Kaldari: Allow admins to remove users from MP3 uploaders user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) [18:06:21] 10Operations, 10Gerrit: Switch on http/2 in apache for gerrit - https://phabricator.wikimedia.org/T180978#3775204 (10Paladox) [18:08:29] (03CR) 10Ayounsi: [C: 032] netbox: rename duplicate Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/392410 (owner: 10Volans) [18:08:38] (03PS2) 10Ayounsi: netbox: rename duplicate Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/392410 (owner: 10Volans) [18:09:36] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3775220 (10phuedx) a:03phuedx [18:17:31] (03PS1) 10Elukey: Bump version to 0.4 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392454 [18:17:49] (03CR) 10Elukey: [V: 032 C: 032] Bump version to 0.4 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392454 (owner: 10Elukey) [18:19:20] (03CR) 10TerraCodes: [C: 031] "Scheduled for Nov 20 moring SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [18:19:40] (03CR) 10TerraCodes: [C: 031] "Scheduled for Nov 20 moring SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [18:23:26] !log smalyshev@tin Started deploy [wdqs/wdqs@2e39b69]: Blazegraph and GUI update [18:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:53] 10Operations, 10Cloud-VPS, 10User-bd808, 10cloud-services-team (Kanban): End self-service new Trusty instance creation in Cloud VPS; standardize on Debian base images - https://phabricator.wikimedia.org/T161899#3775256 (10Andrew) 05Open>03Resolved I made our standard Trusty image 'private' by default,... [18:24:59] (03PS1) 10ArielGlenn: gzip compress output from api jobs and abstracts dumps [dumps] - 10https://gerrit.wikimedia.org/r/392455 [18:25:09] !log smalyshev@tin Finished deploy [wdqs/wdqs@2e39b69]: Blazegraph and GUI update (duration: 01m 42s) [18:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [18:28:08] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:38:26] 10Operations, 10Discovery, 10Wikimedia-Mailing-lists, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Email list needed for automating the Wikipedia.org portal - https://phabricator.wikimedia.org/T180976#3775308 (10debt) Hi @RobH - please use this as a description: 'Public list for the automated statisti... [18:48:02] (03CR) 10Ottomata: [C: 032] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [18:48:06] (03PS10) 10Ottomata: EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [18:48:08] (03CR) 10Ottomata: [V: 032 C: 032] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [18:48:29] 10Operations, 10Discovery, 10Wikimedia-Mailing-lists, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Email list needed for automating the Wikipedia.org portal - https://phabricator.wikimedia.org/T180976#3775352 (10RobH) 05Open>03Resolved a:03RobH Done! I've gone ahead and created the list with yo... [18:48:42] (03CR) 10Herron: [C: 032] icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) (owner: 10Herron) [18:48:51] (03PS3) 10Herron: icinga: add support for puppet 4 in backend puppetmaster https checks [puppet] - 10https://gerrit.wikimedia.org/r/392423 (https://phabricator.wikimedia.org/T180944) [18:49:37] (03CR) 10ArielGlenn: [C: 032] gzip compress output from api jobs and abstracts dumps [dumps] - 10https://gerrit.wikimedia.org/r/392455 (owner: 10ArielGlenn) [18:51:33] !log disable puppet across cloud things to rollout https://gerrit.wikimedia.org/r/#/c/392168/ slowly [18:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:36] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:46] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.tag_summary: Duplicate entry 355179201 for key tag_summary_rev_id, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1070-bin.001487, end_log_pos 487021887 [18:55:11] (03CR) 10Rush: [C: 032] openstack: openstack2 => openstack [puppet] - 10https://gerrit.wikimedia.org/r/392168 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:55:17] (03PS3) 10Rush: openstack: openstack2 => openstack [puppet] - 10https://gerrit.wikimedia.org/r/392168 (https://phabricator.wikimedia.org/T171494) [18:56:54] PROBLEM - Check systemd state on graphite1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:05] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 0.177 second response time [18:57:14] PROBLEM - Check systemd state on graphite2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:15] RECOVERY - puppetmaster backend https on puppetmaster2002 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.162 second response time [18:57:34] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.167 second response time [19:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171120T1900). [19:00:05] James_F, Amir1, jan_drewniak, kaldari, and Zackary: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:29] Heya. [19:00:29] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3775404 (10aborrero) I would need more information. These are my findings: * I can't find the `role::labs::puppetmaster` puppet role. Where do... [19:00:39] o/ [19:00:39] I will fix dbstore1001 [19:00:44] o/ [19:01:51] 10Operations, 10Cloud-VPS, 10User-bd808, 10cloud-services-team (Kanban): End self-service new Trusty instance creation in Cloud VPS; standardize on Debian base images - https://phabricator.wikimedia.org/T161899#3775408 (10bd808) a:05bd808>03Andrew [19:01:51] o/ [19:03:39] o/ [19:03:53] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:04:22] RECOVERY - Check systemd state on graphite2001 is OK: OK - running: The system is fully operational [19:07:22] PROBLEM - Check systemd state on graphite2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:10:23] Anyone deploying? [19:13:23] Seems not. :-( Amir1, can you? [19:13:50] yeah [19:13:54] I can do the SWAT [19:14:09] Thanks. [19:14:13] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [19:17:22] (03PS2) 10Ladsgroup: Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [19:17:33] (03CR) 10Ladsgroup: [C: 032] Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [19:17:47] rebase :/ [19:18:52] (03Merged) 10jenkins-bot: Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [19:19:02] (03CR) 10jenkins-bot: Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [19:19:39] James_F: your patch is in mwdebug1002 [19:21:18] Amir1: Yup, LGTM. [19:22:30] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392432 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:22:55] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Switch submit button from 'save' to 'publish' on dewiki (duration: 00m 50s) [19:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:12] (03PS1) 10Legoktm: keys: Document which key is which in keys.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392461 [19:23:14] (03PS1) 10Legoktm: keys: Note that Chris, Mark and Markus are no longer releasing new versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392462 (https://phabricator.wikimedia.org/T180615) [19:23:16] (03PS1) 10Legoktm: keys: Remove keys of former release managers from keys.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392463 [19:23:18] (03PS1) 10Legoktm: keys: Document usage of gpg --fetch-keys to import all keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392464 [19:23:19] Thanks Amir1. [19:23:50] Thanks for deploying with releng ;) [19:23:59] jan_drewniak: you're next [19:24:14] yippee [19:24:34] jan_drewniak: Should I pull it in mwdebug1002? [19:25:27] Amir1: sure ( although I don't know the specific steps to do that) [19:25:34] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392432 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:25:38] I can do it [19:25:40] don't worry [19:26:31] jan_drewniak: it should be live there [19:27:39] Amir1: yup, looks good! [19:29:09] ok [19:30:07] (03PS21) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [19:30:37] (03PS3) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [19:30:47] !log ladsgroup@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:392432|Bumping portals to master (T128546)]] (duration: 00m 49s) [19:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [19:31:38] !log ladsgroup@tin Synchronized portals: SWAT: [[gerrit:392432|Bumping portals to master (T128546)]] (duration: 00m 51s) [19:31:42] deployed [19:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:57] (03PS3) 10Ladsgroup: Allow admins to remove users from MP3 uploaders user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [19:32:11] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [19:32:27] Amir1: thanks! [19:32:32] kaldari: yours is next, testable? [19:32:45] jan_drewniak: Thank you, Keep up the great work [19:32:50] Amir1: Yes [19:32:55] cool [19:33:23] (03Merged) 10jenkins-bot: Allow admins to remove users from MP3 uploaders user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [19:34:13] kaldari: it's live in mwdebug1002 [19:34:37] checking... [19:35:14] Amir1: Looks good, feel free to sync [19:35:25] thanks [19:36:15] Krinkle: o/ - coal.service on graphite1001 seems failing for a KeyError exception https://phabricator.wikimedia.org/P6355 [19:36:28] cc ottomata (it mentions eventlogging) [19:36:50] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Allow admins to remove users from MP3 uploaders user group (T180002) (duration: 00m 49s) [19:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:56] T180002: Create new "MP3 uploaders" user group on Commons - https://phabricator.wikimedia.org/T180002 [19:37:03] hmmm [19:37:05] ah yes the capsule refactoring! [19:37:08] it might be that [19:37:24] OH, ya. [19:37:25] oops [19:37:26] fixing [19:37:29] coal.. [19:37:29] hm [19:37:37] first time that I hear about it [19:37:38] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [19:37:40] :D [19:37:46] yeah, i asked Krinkle if webperf stuff used timestamp, i guess he forgot coal [19:37:56] kaldari: your change is live [19:38:00] we can always blame ori :D [19:38:19] ottomata: Oops :) [19:38:20] Amir1: Is it too late for me to add a commit? https://gerrit.wikimedia.org/r/#/c/392465/ [19:38:43] ottomata: Hm.. should be avoidable, it uses it to compute a 5-min moving median [19:38:50] Not sure why it would need the capsule timestamp [19:38:51] (03Merged) 10jenkins-bot: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [19:38:59] ottomata: What is happening right now? Are we rolled back? Or is it still down? [19:39:06] RoanKattouw: Nah, let's do it [19:39:15] Awesome thanks [19:39:41] Krinkle: its down, i'm editing code to fix. [19:39:52] (03PS1) 10Chad: wikidatawiki to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392466 [19:39:54] (03PS1) 10Chad: group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 [19:40:37] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:40:43] !log ladsgroup@tin Synchronized wmf-config/throttle.php: Adjust throttle.php for dewiki workshop (T180046) (duration: 00m 48s) [19:40:48] Amir1: Looks good [19:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:50] T180046: Account creation throttle exemption for workshop on 2017-12-11 - https://phabricator.wikimedia.org/T180046 [19:40:50] Thanks! [19:41:01] kaldari: Thank you! [19:41:11] Zackary: You're first patch is live [19:41:49] nope, forgot to rebase [19:41:55] deploying again [19:42:01] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392432 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:42:03] (03CR) 10jenkins-bot: Allow admins to remove users from MP3 uploaders user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [19:42:05] (03CR) 10jenkins-bot: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [19:42:31] !log ladsgroup@tin Synchronized wmf-config/throttle.php: Adjust throttle.php for dewiki workshop (T180046) (duration: 00m 49s) [19:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] (03PS22) 10Ladsgroup: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:43:31] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3720398 (10ArielGlenn) @hoo: Which hosts are we looking at then, varnish servers and the app servers? Also, what do you need to be able to strace? I assume these will... [19:43:34] (03CR) 10Ladsgroup: [C: 032] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:43:40] (03PS1) 10Ottomata: Fix coal to use EventCapsule dt instead of timestamp [puppet] - 10https://gerrit.wikimedia.org/r/392468 (https://phabricator.wikimedia.org/T179625) [19:43:43] Krinkle: ^ [19:44:36] (03PS2) 10Ottomata: Fix coal to use EventCapsule dt instead of timestamp [puppet] - 10https://gerrit.wikimedia.org/r/392468 (https://phabricator.wikimedia.org/T179625) [19:44:38] actually ^ [19:44:52] RoanKattouw: Group2 is on wmf.7, do you want that backport too? [19:44:53] (03Merged) 10jenkins-bot: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:44:59] Nah [19:45:17] kk [19:45:17] group2 is moving to wmf.8 very soon, right? [19:45:22] Krinkle: am applying that manually on graphite1001, stopping puppet and restarting coal [19:45:23] Later today, i thoughtt [19:45:24] IDK :D [19:46:11] Amir1: thanks [19:46:22] Zackary: your second patch is live on mwdebug1002, test please [19:46:25] (03PS3) 10Ottomata: Fix coal to use EventCapsule dt instead of timestamp [puppet] - 10https://gerrit.wikimedia.org/r/392468 (https://phabricator.wikimedia.org/T179625) [19:46:31] and let me know when I can move on [19:47:05] !log restarted coal with fixes for eventcapsule changes in T179625 [19:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:12] T179625: Resolve EventCapsule / MySQL / Hive schema discrepancies - https://phabricator.wikimedia.org/T179625 [19:47:12] RECOVERY - Check systemd state on graphite1001 is OK: OK - running: The system is fully operational [19:47:43] looks good so far, merging. [19:48:09] (03CR) 10Ottomata: [C: 032] Fix coal to use EventCapsule dt instead of timestamp [puppet] - 10https://gerrit.wikimedia.org/r/392468 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [19:48:30] idk how I'm supposed to test it since I don't have bureaucrat or an NDA, but the wiki looks like it still running? [19:48:35] (03PS5) 10Rush: phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 [19:48:41] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:48:46] Zackary: I'm 'crat in fawiki [19:48:51] let me test it there [19:49:25] (03CR) 10jenkins-bot: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:49:42] works fine [19:49:45] moving forward [19:49:55] aw man, coal STILL works on the zeromq evenltogging endpoint!? [19:50:46] ah, thanks [19:51:01] (03PS4) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [19:51:08] ottomata: https://phabricator.wikimedia.org/T110903#3582398 [19:51:16] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Remove overlapping userrights (T101983) (duration: 00m 49s) [19:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:22] ah ok, it might be decomed [19:51:22] T101983: Allow admins adding confirmed user group on all WMF wikis - https://phabricator.wikimedia.org/T101983 [19:51:23] cool. [19:51:49] (03CR) 10Rush: [C: 032] phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [19:52:14] (03CR) 10Krinkle: "Don't forget PrivateSettings.php.example (and to the merger: The real file in production, before syncing)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [19:52:30] !log updating phab mail handler [19:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:45] thanks elukey :) [19:55:52] (03Abandoned) 10GeoffreyT2000: Set 'watchcreations' preference to true by default on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) (owner: 10GeoffreyT2000) [19:58:50] !log re-routing ns1.wikimedia.org traffic to radon for baham reboot [19:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:26] (03PS5) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [20:00:43] RoanKattouw: your patch is live in mwdebug1002 [20:01:34] Thanks, checking [20:01:57] (03CR) 10jerkins-bot: [V: 04-1] $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [20:03:07] (03PS1) 10Ottomata: Remove eventlogging_refine_test temporary class [puppet] - 10https://gerrit.wikimedia.org/r/392473 (https://phabricator.wikimedia.org/T179625) [20:04:11] (03CR) 10Ottomata: [C: 032] Remove eventlogging_refine_test temporary class [puppet] - 10https://gerrit.wikimedia.org/r/392473 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:04:16] (03PS2) 10Ottomata: Remove eventlogging_refine_test temporary class [puppet] - 10https://gerrit.wikimedia.org/r/392473 (https://phabricator.wikimedia.org/T179625) [20:04:18] (03CR) 10Ottomata: [V: 032 C: 032] Remove eventlogging_refine_test temporary class [puppet] - 10https://gerrit.wikimedia.org/r/392473 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:04:42] RECOVERY - Check systemd state on graphite2001 is OK: OK - running: The system is fully operational [20:05:33] (03PS4) 10Ladsgroup: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392200 (https://phabricator.wikimedia.org/T180879) [20:05:45] (03PS6) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [20:05:57] (03CR) 10Brian Wolff: [C: 031] keys: Document usage of gpg --fetch-keys to import all keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392464 (owner: 10Legoktm) [20:06:00] !log rebooting baham [20:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:10] Amir1: It has a bug but I think we're better off with than without, so let's proceed with deploying that patch [20:06:16] mooeypoo has a fix coming [20:06:32] kk [20:06:37] Moving forward :) [20:06:51] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392200 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [20:07:14] !log ladsgroup@tin Synchronized php-1.31.0-wmf.8/resources/src/mediawiki.rcfilters/ui/mw.rcfilters.ui.ItemMenuOptionWidget.js: RCFilters: Only apply excluded label to namespace items (T180863) (duration: 00m 49s) [20:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:21] T180863: [wmf.7] "Excluded" label is displayed with filter selection - https://phabricator.wikimedia.org/T180863 [20:08:10] (03Merged) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392200 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [20:08:20] (03CR) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392200 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [20:08:21] PROBLEM - HHVM rendering on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:41] PROBLEM - Host ns1-v6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:ed1a::e) [20:09:11] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [20:09:11] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 79463 bytes in 0.306 second response time [20:09:54] Amir1: https://gerrit.wikimedia.org/r/#/c/392477/1 is the follow-up, turns out we had a typo [20:11:00] !log CI docker jobs were all broken due to a mistake. Should be back now. T177684 [20:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:07] T177684: Should we expose some JENKINS_ environment variables in docker? - https://phabricator.wikimedia.org/T177684 [20:11:09] RoanKattouw: haha, do you want to cherry-pick it? [20:11:20] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Enable Translate extension in amwikimedia (T180879) (duration: 00m 49s) [20:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:27] T180879: Install translate extension in amwikimedia - https://phabricator.wikimedia.org/T180879 [20:12:03] (03PS1) 10Ladsgroup: Revert "Enable Translate extension in amwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392481 [20:12:07] (03CR) 10Ladsgroup: [C: 032] Revert "Enable Translate extension in amwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392481 (owner: 10Ladsgroup) [20:12:38] (03PS7) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [20:12:45] Needs table creation [20:13:28] (03PS1) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:13:42] (03CR) 10jerkins-bot: [V: 04-1] phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:13:45] (03PS2) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:14:06] (03CR) 10jerkins-bot: [V: 04-1] phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:14:29] (03PS3) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:14:46] (03CR) 10Paladox: phab: email_pipe needs path to handler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:14:53] (03CR) 10jerkins-bot: [V: 04-1] phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:15:35] (03Merged) 10jenkins-bot: Revert "Enable Translate extension in amwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392481 (owner: 10Ladsgroup) [20:16:22] (03PS4) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:16:42] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Enable Translate extension in amwikimedia (T180879)" (duration: 00m 48s) [20:16:43] RoanKattouw: It's waay out of SWAT time and I had to revert one patch too (mine!) I think that would be great for evening SWAT [20:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:49] T180879: Install translate extension in amwikimedia - https://phabricator.wikimedia.org/T180879 [20:16:51] !log ns1 dns traffic back to normal on baham [20:16:55] (03CR) 10Paladox: phab: email_pipe needs path to handler (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:00] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Remove overlapping userrights (T101983) (duration: 19m 41s) [20:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:06] T101983: Allow admins adding confirmed user group on all WMF wikis - https://phabricator.wikimedia.org/T101983 [20:17:14] (03CR) 10jenkins-bot: Revert "Enable Translate extension in amwikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392481 (owner: 10Ladsgroup) [20:17:53] (03CR) 10Paladox: phab: email_pipe needs path to handler (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:18:29] I I'm out [20:18:36] (03PS5) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:18:58] (03CR) 10jerkins-bot: [V: 04-1] phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:20:17] note to self, try to hurry and it'll take twice as long [20:20:17] (03PS6) 10Rush: phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 [20:20:54] (03CR) 10Rush: [C: 032] phab: email_pipe needs path to handler [puppet] - 10https://gerrit.wikimedia.org/r/392483 (owner: 10Rush) [20:23:47] (03PS1) 10Rush: phab: add phab_bot section back to config [puppet] - 10https://gerrit.wikimedia.org/r/392485 [20:24:15] (03CR) 10Rush: [C: 032] phab: add phab_bot section back to config [puppet] - 10https://gerrit.wikimedia.org/r/392485 (owner: 10Rush) [20:28:03] (03PS8) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [20:35:10] (03PS2) 10Rush: toolforge: Replace /usr/local/bin/crontab with oge-crontab [puppet] - 10https://gerrit.wikimedia.org/r/391979 (https://phabricator.wikimedia.org/T156174) (owner: 10BryanDavis) [20:36:08] (03Draft1) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 [20:36:10] (03PS2) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [20:36:32] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [20:36:34] (03CR) 10Paladox: "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [20:38:11] (03CR) 10Rush: [C: 032] toolforge: Replace /usr/local/bin/crontab with oge-crontab [puppet] - 10https://gerrit.wikimedia.org/r/391979 (https://phabricator.wikimedia.org/T156174) (owner: 10BryanDavis) [20:39:45] (03PS3) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [20:40:13] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [20:40:58] hmm not sure why it failed with include and fails with class{}. [20:41:06] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Switch on http/2 in apache for gerrit - https://phabricator.wikimedia.org/T180978#3775800 (10demon) [20:42:44] 10Operations, 10Phabricator, 10Traffic: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3775802 (10Paladox) [20:52:26] (03PS4) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [20:52:47] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [20:55:58] (03PS5) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [20:56:18] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [20:58:21] (03CR) 10Chad: [C: 032] group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 (owner: 10Chad) [20:58:48] Amir1: OK, thanks, I'll put it in evening SWAT [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171120T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:21] PROBLEM - Nginx local proxy to apache on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:29] Something fun in store for ORES. Taxiing on the runway now. [21:01:12] RECOVERY - Nginx local proxy to apache on mw2121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.198 second response time [21:04:02] apergos: bon soir! Any idea why I can’t ssh to tin? DM’ing details... [21:04:46] hello! [21:05:36] no parsoid deploy today [21:06:16] (03PS6) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [21:07:40] !log re-routing ns0.wikimedia.org traffic to baham for radon reboot [21:07:41] (03Draft1) 10Paladox: Gerrit: Move apache resources to the profile instead of gerrit::proxy [puppet] - 10https://gerrit.wikimedia.org/r/392494 [21:07:45] (03PS2) 10Paladox: Gerrit: Move apache resources to the profile instead of gerrit::proxy [puppet] - 10https://gerrit.wikimedia.org/r/392494 [21:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:20] (03Draft1) 10Paladox: apache: Add http2 to mod [puppet] - 10https://gerrit.wikimedia.org/r/392495 [21:09:23] (03PS2) 10Paladox: apache: Add http2 to mod [puppet] - 10https://gerrit.wikimedia.org/r/392495 [21:09:54] (03PS7) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [21:10:02] (03PS8) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [21:10:04] ssorry for spam. [21:10:48] (03PS9) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) [21:13:49] !log rebooting radon [21:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:05] !log routing ns0.wikimedia.org back to radon post-reboot [21:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:32] PROBLEM - Apache HTTP on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:59] Excuse my ignorance and lack of docs but what exactly is radon? [21:27:19] radon is the name of a server [21:27:21] RECOVERY - Apache HTTP on mw2124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [21:27:47] and also a chemical element: https://en.wikipedia.org/wiki/Radon [21:28:02] (that's our naming scheme for servers in the Virginia datacenter) [21:28:06] That seems to be a common theme with server naming bblack xD [21:28:43] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Miscellaneous_servers [21:29:46] (also, there are no misc servers in ulsfo, I think I put "Famous Druids" in there as a joke a long time ago and it just got left there) [21:33:15] !log routing ns2.wikimedia.org to radon for eeden reboot [21:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:48] !log awight@tin Started deploy [ores/deploy@5084251]: Updating ORES to revscoring 2.0.10, T179711 [21:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:55] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [21:38:21] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:29] (03PS1) 10Andrew Bogott: Repool labvirt1015, again [puppet] - 10https://gerrit.wikimedia.org/r/392514 [21:43:32] (03CR) 10Rush: [C: 031] Repool labvirt1015, again [puppet] - 10https://gerrit.wikimedia.org/r/392514 (owner: 10Andrew Bogott) [21:44:05] @seen Coren [21:44:06] yannf: Last time I saw Coren they were quitting the network with reason: Remote host closed the connection N/A at 7/23/2017 2:40:18 AM (120d19h3m48s ago) [21:44:07] !log rebooting eeden [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:17] (03CR) 10Andrew Bogott: [C: 032] Repool labvirt1015, again [puppet] - 10https://gerrit.wikimedia.org/r/392514 (owner: 10Andrew Bogott) [21:44:18] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Switch on http/2 in apache for gerrit - https://phabricator.wikimedia.org/T180978#3775978 (10demon) Chatted with @bblack on IRC, couple of quick notes: * Good idea in general * Apache HTTP2 module hasn't been reviewed or used at WMF yet, so we should d... [21:44:45] will that reboot provide a brand new paradise? [21:46:38] Yes called a server working [21:47:02] I think that one was named for https://en.wikipedia.org/wiki/Marcel_van_Eeden [21:47:51] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.78 ms [21:50:57] or some such famous Dutch person [21:51:01] maybe it was https://en.wikipedia.org/wiki/Frederik_van_Eeden [21:52:05] (03PS2) 10Chad: group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 [21:52:08] !log aaron@tin Synchronized php-1.31.0-wmf.8/extensions/Collection: 3baebf4a: cache key name fix (duration: 00m 51s) [21:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:19] !log ns2 back on eeden [21:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:27] 10Operations, 10ops-esams, 10Traffic: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3776032 (10RobH) Added new group to self dispatch as Dell support advised and it did not allow me to send the part. I've emailed into support asking for next steps. [21:54:23] (03CR) 10Chad: [C: 032] group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 (owner: 10Chad) [21:57:41] (03Merged) 10jenkins-bot: group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 (owner: 10Chad) [21:58:58] no_justification: huh, no train is listed at https://wikitech.wikimedia.org/wiki/Deployments#Ongoing this week (though is for next week) [21:59:22] (03CR) 10jenkins-bot: group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392467 (owner: 10Chad) [22:00:05] dapatrick, bawolff, and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171120T2200). Please do the needful. [22:00:06] No GERRIT patches in the queue for this window AFAICS. [22:03:03] AaronSchulz: There is no train this week :) [22:03:09] It's last week's train :p [22:03:45] hmm, maybe irccloud messed up? [22:03:48] the train is a bit behind schedule. Thankfully this isn't Tokyo subways where Chad would need to quit due to being so late. [22:04:02] It wasn't the conductor's fault ;-) [22:04:10] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.8 [22:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:44] no_justification: :) :) [22:06:32] * no_justification eyes Aaron for stealing his lock [22:08:15] fwiw, I’m rolling over my deployment window. These repos are slooow to clone. [22:08:57] This is only a deployment to the ORES servers, so worst-case impact will be some 500s and weird behavior of the new RCFilters. [22:09:23] * no_justification runs off to get a delayed lunch since everyone's going to deploy it seems [22:09:45] lolol [22:10:08] Lunch is funny? [22:10:15] * no_justification grumbles [22:10:41] When lunch is a form of procrastination and hiding from monsters, IMO yes [22:10:53] haha pre-lunch jokes are unfunny though. [22:11:35] Oh, lunch isn't a form of procrastination. It's meant to fill time I would rather be working :\ [22:11:36] huh, running sync-file on a dir either hangs or actually tries to do it all [22:11:39] * no_justification wanders off [22:11:42] could use some dummy proofing [22:11:50] !log aaron@tin Synchronized php-1.31.0-wmf.8/includes/libs/objectcache/WANObjectCache.php: 7e74b49: namespace WAN cache variant keys (duration: 00m 48s) [22:11:56] On a directory? Yes, it'll lint it all [22:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:36] I'm not sure what "do it all" means [22:12:41] It's the same codepath for files & dirs. [22:13:29] AaronSchulz: fwiw, I’ve recently sync-file’d a dir, and it was a smooth experience. Only synced the files I intended to. [22:13:38] This may have been 2 weeks ago. [22:14:32] New scap code went out in that time period but sync-file hasn't changed lately [22:16:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [22:20:22] no_justification: I guess it might be useful for sync-file to reject dirs and for sync-dir to reject files. [22:20:28] Assuming that is what AaronSchulz is referring to. [22:20:37] They're the same code [22:20:42] E.g. when you intend to sync a file but stopped early with auto completion [22:20:55] Or put a space somewhere, etc. [22:21:11] Sync dir is a back compatible alias [22:21:17] Right [22:21:42] That's how it is now :) [22:21:46] Its been like that for ages now [22:22:43] I guess a dir is a file :D [22:25:16] Isnt technically everything at one point a file on a computer? [22:26:16] !log aaron@tin Synchronized php-1.31.0-wmf.7/extensions/Collection: cache key name fix (duration: 00m 49s) [22:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:29] no_justification: fyi, I’m done deploying to production, and my scap is just rolling across two non-production clusters. [22:27:41] !log awight@tin Finished deploy [ores/deploy@5084251]: Updating ORES to revscoring 2.0.10, T179711 (duration: 49m 54s) [22:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:47] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [22:27:58] in fact. it failed on the non-production machines \o/ [22:28:04] * no_justification adds a lunch plugin to scap [22:28:29] I'm sure there's an API for postmates! [22:28:33] scap snack [22:28:34] crunchy! [22:28:40] talk about dogfooding [22:29:12] !log aaron@tin Synchronized php-1.31.0-wmf.7/includes/libs/objectcache/WANObjectCache.php: fix cache key namespace (duration: 00m 49s) [22:29:14] Scap snack® by Wikimedia Tech Community [22:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:28] Krinkle: ok, the key fix should be on all wikis. I guess stats cleanup can commence. [22:30:00] AaronSchulz: OK. Wanna try it yourself this time? I didn't realise last time that you have root :) [22:30:11] Just make sure it's applied to both 1001 and 2001 [22:33:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:42:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [22:47:33] Krinkle: I can't seem to login [22:48:48] awight: are you reverting the ORES deploy? [22:49:06] legoktm: We’re discussing in #wikimedia-ai [22:49:07] https://phabricator.wikimedia.org/T181006 needs to be fixed ASAP [22:49:08] !log Sharp rise in HTTP 500 errors as of 22:05 (45 minutes ago) [22:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:15] Krinkle: it's https://phabricator.wikimedia.org/T181006 [22:49:30] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json [22:50:22] MatmaRex: thx [22:50:32] !log MW HTTP 500 spike tracked as https://phabricator.wikimedia.org/T181006 [22:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:29] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:52:33] ^ shows here too [22:52:43] Krinkle: actually, i'm not sure if the timestamps for this match? that bug is caused by a deployment at 22:27. although it has "duration: 49m 54s". [22:53:39] Indeed. [22:53:41] Its caused by wikiversions [22:53:49] no_justification: ^ [22:53:54] Can we roll back to make sure? [22:54:06] (Or first identify the cause / which requests are fatalling) [22:54:11] Starting an emergency rollback for ORES [22:54:12] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=1511209866383&to=1511218400121 [22:54:15] Son of a mother fucking bitch [22:54:15] logstash will probably tell [22:54:20] !log awight@tin Started deploy [ores/deploy@5084251]: Rollback ORES; T179711 [22:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:27] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [22:54:33] Lets see what happens with ores rollback? [22:55:11] !log rolling back ORES to fix T181006 [22:55:15] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: no wmf.8 for group2. i hate my life [22:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:16] T181006: Watchlist and RecentChanges don't work on ruwiki - https://phabricator.wikimedia.org/T181006 [22:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:26] !log awight@tin Finished deploy [ores/deploy@5084251]: Rollback ORES; T179711 (duration: 01m 05s) [22:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:02] (03PS1) 10Chad: Revert "group2 to wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392531 [22:56:06] (03CR) 10Chad: [V: 032 C: 032] Revert "group2 to wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392531 (owner: 10Chad) [22:57:01] no_justification: Sorry :/ [22:57:22] It's ok. Just adding to the pile of reasons I want a do-over on today [22:57:44] I'm about 3 snarky comments from rage-quitting for the day [22:57:57] Can we just skip .8 for group2 it made it very clear it dont want it [22:58:06] * bblack snarks [22:58:10] Zppix: Shut up [22:58:20] Ok [22:58:22] Krinkle: What bug are you working on? I’m 95% certain that T181006 was my fault, due to the ORES deployment. [22:58:37] is people reverting already? [22:58:46] awight: nothing. Just noticed HTTP 500 alerts in Graphite. And timestamps don't align with ORES, it started 20min earlier. [22:58:53] It alings with the branch going out [22:58:56] so we're reverting that [22:59:29] I just can't win. [22:59:34] (03CR) 10jenkins-bot: Revert "group2 to wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392531 (owner: 10Chad) [22:59:36] * no_justification runs off to mope for a bit [22:59:53] Krinkle: Sorry to not report here sooner—this was an unexpected side-effect of fixing T179711, and I’ve rolled back the ORES change which caused it... [22:59:54] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [23:00:12] now, why aren’t the errors tapering off... [23:00:20] my theory is in tatters so far [23:00:20] awight: cache? [23:00:42] Zppix: That could be. The thresholds response is cached for a day. [23:01:09] halfak: I’m gonna try to purge the thresholds cache, if I can. [23:02:12] 10Operations, 10Wikimedia-Incident: wmf.8 deploy to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776262 (10Krinkle) [23:02:41] Filing a separate task for hte HTTP spike for now. [23:02:45] ^ [23:02:49] Anyone want to help me purge an object from WANObjectCache? [23:03:20] you could probably just do it from eval.php [23:03:21] 10Operations, 10Wikimedia-Incident: Deploying wmf.8 to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776267 (10Zppix) [23:03:43] bawolff: +1 I’m headed thataway [23:05:54] 10Operations, 10Wikimedia-Incident: wmf.8 deploy to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776271 (10Krinkle) [23:06:33] !log smalyshev@tin Started deploy [wdqs/wdqs@7d951d2]: Rollback categories vocabulary version due to a bug [23:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:26] Krinkle: I'll follow the task, but I have zero intent to attempt wmf.8 again today [23:10:49] Key is purged, let’s see if that did the job... [23:11:47] !log purged memcache key 'ruwiki:ORES:threshold_statistics:goodfaith:1’, T181006 [23:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:54] T181006: Watchlist and RecentChanges don't work on ruwiki - https://phabricator.wikimedia.org/T181006 [23:12:17] halfak: ouch, I don’t think the rollback worked [23:12:25] what makes ruwiki, frwiki special? [23:12:27] So both MW and ORES were rolled back [23:12:28] we’re still getting a “null” in the thresholds [23:12:38] Krinkle: It’s possible that I didn’t roll back ORES correctly [23:12:44] or is still failing [23:12:50] according to logstash [23:12:54] *ores [23:13:31] Thought: maybe the ORES extension should fail a little more gracefully when it gets a response from ORES it can't handle. Right now, it tends to fail hard and fast. Which is good for logging, but bad for my train conduction sanity. [23:13:39] This seems to have become a theme, lately [23:13:50] awight: Can you verify the code in question is indeed back to the old state on the deployment host? Eg. check manually with cat/grep or smth. [23:13:56] for ores. [23:14:23] *reads up* [23:14:32] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:14:44] !log smalyshev@tin Finished deploy [wdqs/wdqs@7d951d2]: Rollback categories vocabulary version due to a bug (duration: 08m 11s) [23:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:58] Krinkle: I can confirm that it’s *not* rolled back correctly :( [23:15:03] Retrying that now [23:15:08] no_justification: it is possible the ORES deployment is unrelated. Given this started immediately after the train. 20min before the python/ores deply [23:15:34] The number of 500s we got on Varnishes make me thing this is not jsut RC/WL views. [23:15:36] But need to check logs to be sure [23:15:45] Krinkle: my deploy didn’t take effect until the very end of the scap, which was approx 22:20 UTC [23:15:51] And if it is, it woudl be a regression on the MW side not the ORES side. [23:16:02] My general point about ORES having been flakey and causing lots of MW-level failures lately stands. [23:16:20] The symptoms are exactly what I’d expect for this to be a ORES failure caused specifically by the code I pushed [23:16:22] The MW shim...leaves something to be desired. [23:16:24] lemme try to roll back again [23:16:29] is the revert ongoing? [23:16:50] I’d love to chat more about Ext:ORES, data flow and failure behavior, later :) [23:16:54] jynus: MW rollback is done. I'm just being angry/ranty/snarky at this point. [23:17:09] you can be that later, after revert is done :-D [23:17:16] first we fix [23:17:21] I know fuck all about ORES' rollback. Apparently it failed? [23:17:46] no_justification: hm. wikiversions rollback was not logged to Graphite events? [23:17:47] no_justification: All that happened is, I suck and chose an incorrect SHA-1 [23:17:53] if there is no fix, we undeploy ores [23:17:58] +1 [23:18:00] but hold up [23:18:01] Krinkle: Ugh. [23:18:11] It was logged to SAL, which is what matters ;-) [23:18:20] 10Operations, 10ops-ulsfo: WMF7218 missing serial in racktables - https://phabricator.wikimedia.org/T178609#3776306 (10RobH) 05Open>03Resolved I fixed this awhile back and forgot the task existed. [23:18:22] not to https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=now-1h&to=now [23:18:32] !log awight@tin Started deploy [ores/deploy@95cd523]: Rollback ORES (take 2); 181006 [23:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:24] omg it’s fetching for some reason, rather than using cached repo data. That can’t happen, it takes 45min [23:19:37] * awight fails to remove panic from voice :) [23:19:56] Why should it have to fetch if it's just using an old sha1? [23:19:59] !log aborted ORES rollback [23:19:59] It should already be there. [23:20:05] no_justification: that’s what I’m saying… [23:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:08] jfc. [23:20:42] no_justification: I recorded the wrong SHA-1 for rollback, cos I had to use the same deployment dir for another cluster. [23:22:03] When this is done, I’m putting my deployment keys in the jar. [23:23:04] Krinkle: I vote removing ORES from frwiki, it seems to be the leading offender in logstash [23:23:12] awight: ^ [23:23:21] 1 minute please [23:23:28] I’d really rather roll back. [23:23:50] I would too, if it takes less than 45 minutes. [23:24:16] Per Logstash, it does seem restricted to Special:RC/RL (and their RSS feed) and restricted to those two wikis. [23:24:32] Just verifying that this therefore is likely not related to the train branch, unless caused by a PHP change in the ORES extension. [23:24:49] also the HTTP 500 has not gone down since the train rollback [23:24:49] thanks Krinkle [23:25:01] Krinkle: That was my assumption at this point. [23:25:32] !log awight@tin Started deploy [ores/deploy@82a13ae]: Rollback ORES (take 3); 181006 [23:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:59] ok that rev was cached on the servers, looks good so far. [23:26:24] (03PS1) 10Jcrespo: Ores: Emergency disable on frwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392535 (https://phabricator.wikimedia.org/T181006) [23:27:48] 10Operations, 10Wikimedia-Incident: wmf.8 deploy to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776343 (10Krinkle) [23:28:07] 10Operations, 10Wikimedia-Incident: wmf.8 deploy to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776246 (10Krinkle) [23:28:13] jcrespo: that’s going to mask whether we’ve fixed the problem [23:28:31] well, at least I was trying something [23:28:50] (03CR) 10Awight: "Please don't quite yet. I have a rollback which should fix the issues, it'll be deployed within 5 min." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392535 (https://phabricator.wikimedia.org/T181006) (owner: 10Jcrespo) [23:29:09] 10Operations, 10Wikimedia-Incident: wmf.8 deploy to group2 wikis caused HTTP 500 spike - https://phabricator.wikimedia.org/T181008#3776246 (10Krinkle) 05Open>03Invalid Closing in favour of T181006. [23:29:11] jynus: TY for preparing that patch, it might be needed yet. [23:29:30] awight: I don't think it's acceptable for RC to not be available for this long [23:29:58] I agree with jynus just disable it for now [23:30:04] I am sorry, but while I understand there was a lot of confusion [23:30:25] RC is too critical to be unavailable for this long [23:30:27] why that wasn't almost the first option- there are errors on ores, we disable ores [23:30:31] legoktm: +1, let’s add that point to T181010. I’ll let you decide whether to turn it off immediate [23:30:32] T181010: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010 [23:30:47] it was like that for almost 1:30 hours [23:30:56] I mean, unless you're gonna have it fixed in the next 2 minutes I think it needs to be disabled now [23:31:02] what legoktm said [23:31:12] I’m happy with that [23:31:18] I think it works now [23:31:53] jynus: I’m on 5 out of 10 servers, so it may be intermittent for another 5 min [23:32:13] but errors are still high [23:32:18] https://ru.wikipedia.org//w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%A1%D0%B2%D0%B5%D0%B6%D0%B8%D0%B5_%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8&feed=atom still exceptions currently [23:32:57] 2 minutes are up [23:33:07] (03CR) 10Legoktm: [C: 032] Ores: Emergency disable on frwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392535 (https://phabricator.wikimedia.org/T181006) (owner: 10Jcrespo) [23:33:10] https://fr.wikipedia.org/wiki/Special:Recentchanges [23:33:19] still fatalling there too [23:33:49] only works after reloading without cache [23:33:54] legoktm: can you sync that too? I walked afk for my health for 10mins [23:33:57] yes I am [23:34:03] Tyvm [23:34:18] on fr not even that [23:34:20] (03Merged) 10jenkins-bot: Ores: Emergency disable on frwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392535 (https://phabricator.wikimedia.org/T181006) (owner: 10Jcrespo) [23:34:29] lets disable [23:34:31] thanks [23:34:39] Yep, reload didn't help on frwiki for me either [23:34:47] both are working for me now [23:35:00] !log purge cache keys for ORES thresholds on frwiki and ruwiki [23:35:02] Yep, it's up for me [23:35:02] syncing [23:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:12] it is ok [23:35:20] is not like we cannot rever the disable easily! [23:35:47] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: emergency disable ORES on frwp/ruwp T181006 (duration: 00m 49s) [23:35:48] I think it was a good decision, thanks for helping by pushing on that prong of the attack [23:35:50] Confirmed both working for me [23:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:53] T181006: Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki - https://phabricator.wikimedia.org/T181006 [23:36:37] I don’t see varnish 500s going down in grafana, what’s the lag on that display? [23:36:43] 1m? [23:36:49] (03CR) 10jenkins-bot: Ores: Emergency disable on frwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392535 (https://phabricator.wikimedia.org/T181006) (owner: 10Jcrespo) [23:36:50] maybe a bit more [23:37:13] it is already down on mediawiki [23:37:31] everything dropped to basically 0 on fatalmonitor @ 23:34:30 [23:37:57] I am checking if there is other wikis affected [23:38:13] where’s my goddamn T-shirt [23:38:19] I only tried frwiki because ruwiki is on the same shard, thinking this was the database [23:38:33] maybe other less traffic wikis were affected too [23:38:43] why only frwiki and ruwiki and not other ores ones? [23:39:13] frwiki had an impossible "threshold" (> .99 precision or something like that) [23:39:15] mmm, it seems there are enwiki errors too, related to ores [23:39:21] not sure what happened with ruwiki [23:39:28] but it doesn't show on recentchanges [23:40:04] there will always be errors. [23:40:11] Were they the same error, jynus? [23:40:16] yeah, those seem unrelated [23:40:21] I was just pointing things [23:40:23] (03PS2) 10Dzahn: webperf: Add missing mediaWikiLoad to navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/392036 (https://phabricator.wikimedia.org/T180598) (owner: 10Phedenskog) [23:40:28] to give them a look [23:40:51] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:41:10] no relevant errors in the last 15 minutes [23:41:20] (03CR) 10Dzahn: [C: 032] webperf: Add missing mediaWikiLoad to navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/392036 (https://phabricator.wikimedia.org/T180598) (owner: 10Phedenskog) [23:41:55] so the fact that there was a problem is not the issue- what I do not understad why not revert and/or disable faster? [23:42:13] I understand some things are slow or difficult to revert [23:42:26] We did the first revert did take or whatever properly [23:42:28] the exceptions were only coming from fr and ru, none from any other wikis [23:42:29] Didnt* [23:42:37] only? [23:42:46] that was like 200 errors per minute [23:42:58] alerms must have gone off almost immediately? [23:43:05] let me check the log [23:43:13] https://logstash.wikimedia.org/goto/ee5fa2166c5b004165da4634f6133d38 [23:43:19] no, ops log [23:43:21] We don't seem many 500s from ORES -- if any [23:43:43] Looks like ORES was doing what it was supposed to do and the extension wasn't ready for that ^_^ [23:43:51] PROBLEM - Long running screen/tmux on analytics1003 is CRITICAL: CRIT: Long running SCREEN process. (PID: 22631, 2782586s 1728000s). [23:43:51] ORES ext vs. ORES service [23:44:03] halfak: The ORES extension should be more defensive then [23:44:11] There's basically *no* reason to throw exceptions to users like that [23:44:12] 100% agreed [23:44:16] Right [23:44:22] (a general complaint of mine, I hate throwing exceptions) [23:44:31] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:44:41] [22:16:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [23:44:56] you see that, you revert [23:45:20] jynus, no sense in hunting. Let's get it in the incident report. I think awight has learned something from this already and repeating the lesson will only build resentment. [23:45:40] no_justification: I’ll cover it in T181010, but yeah we knew that Ext:ORES behavior was unknown when getting this new type of error. The reason I didn’t bother to smoke test that specifically was that we thought we were replacing a 500 error with a 404 error that *might* cause a 500 in the extension. I don’t understand exactly why that wasn’t true, yet. [23:45:40] T181010: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010 [23:45:47] this is now awight's issue [23:45:52] *not [23:46:07] awight: I'm talking *generally* in the extension, not just this instance. [23:46:09] there were other people aware on IRC [23:46:11] !log phab2001 - reboot for kernel upgrade [23:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:22] It's become a bit of a pattern: ORES (the service) returns something to ORES (the extension) that the latter can't handle [23:46:26] So the extension explodes [23:46:34] And sends us scrambling [23:46:38] ok, then we do no deploy ores [23:47:09] no_justification, right. this is a problem we're aware of. It's become something I'm uncomfortable with. The beta deploy seems to not be helping us catch these things. [23:47:09] jynus: I’m shaken from this, too—but that’s not helpful. [23:48:43] no_justification: jynus: Krinkle: legoktm: greg-g: I’m actually looking forward to this incident report, please put your notes in T181010 if you want me to capture and elaborate. [23:48:55] +1 ^ [23:49:07] We'll reach out to releng about how to boost our beta chops. [23:49:18] Well, betas aren't very useful without real traffic. [23:49:21] I'll be looking into why this didn't show up in beta. [23:49:23] It's hard to exercise all code paths [23:49:26] no_justification, so beta isn't usefu;l [23:49:32] We'll need to make beta useful [23:49:35] It has some uses, sure. [23:49:50] can we googlize this problem? just put a BETA logo on the main page and then outages aren't an issue at all :) [23:49:59] bblack: Best idea, obvs. [23:50:04] bblack, lol [23:50:07] halfak: In this case, yeah it’s the code path. It’s because the change had the potential to destabilize *every* wiki, but we didn’t have the config or tools to UI test every S:RL [23:50:14] halfak: But my point is that we shouldn't have to beta test for these failure modes. [23:50:15] bblack: does it mean i can break it more xD [23:50:45] grep for RuntimeException, replace with something that logs loudly, but fails nicely back to the user. [23:50:46] bblack: That’s not so unreasonable—I wouldn’t mind seeing 0.01% of real traffic go through beta. But I’m ruthless that way. [23:50:53] *Defensive* programming. [23:51:14] no, real traffic cannot be on beta due to security concerns [23:51:18] that is why it is separate [23:51:28] e.g. everybody can see beta ips [23:51:49] jynus: True, we’d need another mechanism, and it’d be super expensive. Another full cluster with some new security level. [23:52:00] what you may want is to enable production in percentages [23:52:04] that would be more what you want [23:52:06] Or maybe we find a way to anom the ips on beta from logs? [23:52:23] jynus: so much yes [23:52:37] what I do not understand is the stigma of undeploys disabling and rollbacks [23:52:51] nobody is sayting a code is bad for suffering that [23:52:53] jynus: I was happy to rollback, just fucked it up the first time. [23:52:59] it is just a precaution [23:53:16] This was something that was triggered on every request? Seems like a good case of should have had some sort of automated testing [23:53:21] also, rollback takes *really* long for this code because part of the deal is that we need to reinstall a bunch of python libs [23:53:27] ok [23:53:33] then you have an actionable there [23:53:44] if it is not rollback'ed fast and easy [23:53:49] it should not be deployed [23:53:55] The other place I screwed up was to not monitor the client side, only the server side [23:53:56] well another thing I'm not sure we're really great at yet (at many different levels, in ops too) is making sure each stepwise change we deploy is both independent and possible to revert easily. (in other words, forwards and reverse compatible in some deeper sense) [23:54:02] I did already list these things on the bug [23:54:05] I know it is not that easy [23:54:10] but it is not impossible [23:54:13] you have to have that to have gradual rollouts, anyways [23:54:28] but sometimes a change affects data and then the code rollback leaves things still-broken, etc [23:54:49] bblack but that is something that has to be considered on the design [23:54:49] awight: halfak one way of making beta cluster usable to test ORES would be to write automated browser tests, maybe? Or is the data need too large for that? [23:54:56] it's a tricky part of the problem [23:54:58] bblack: +1 The rollback included a godforsaken memcached key purge [23:55:11] that would have been pretty hard to anticipate and code for, though [23:55:15] bblack, if it is thought from the beginning, we shouldn't reach problems [23:55:24] bblack, and I am talking from the knowledge [23:55:33] I perform schema changes that take 6 month to complete [23:55:37] greg-g: I’m not sure yet, I think it would have helped but not like a % of real traffic [23:56:03] awight: was this traffic based or just bad code path? [23:56:15] (03CR) 10Dzahn: [C: 032] "thank you! looks cool:)" [puppet] - 10https://gerrit.wikimedia.org/r/389498 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [23:56:17] (or both?) ;) [23:56:33] it can be a deep thing to solve though when we're talking about things like memcache data changing. or a code change that changes the compatibility of the contents of some user-data field for all users that have logged in at least once under the new code [23:56:36] (03PS7) 10Dzahn: planet: Improve look and configuation updates [puppet] - 10https://gerrit.wikimedia.org/r/389498 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [23:56:38] those kinds of things are hard [23:57:17] you have to start versioning what your data means, too, independently of the code [23:57:52] greg-g: Good question, UI tests would catch some things that real traffic wouldn’t, but vice-versa. What I meant was that real traffic would definitely have caught this, but that’s not much of an argument for a huge $$ project. [23:57:54] (in the case where we had the rest of the infrastructure to be pushing new code to 1% of live traffic, that same issue surfaces even more-brutally) [23:58:21] awight: right right, we all want a real staging cluster (spoiler: in the works, but like next fiscal or more ;) ) [23:58:23] I mean you do not have to be so edgy, as stupid DBA disabled the code, and you cannot get dumber than being a mysql dba! [23:58:51] the rest of the people are good coders, I have confidence on them! [23:58:56] bblack: Since you mention it, we actually have that capability in place, the ORES cache in question is versioned, but incrementing it isn’t done automatically so we have to anticipate and code that in.