[00:04:30] oauth is working good at foundationwiki https://wikimediafoundation.org/wiki/Special:RecentChanges [00:04:44] now don't blame me for doing the maintenance :P [00:29:37] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 396.63 seconds [00:30:08] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 404.77 seconds [00:40:48] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.10 seconds [00:52:48] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.53 seconds [00:53:18] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.00 seconds [00:58:57] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.82 seconds [01:05:28] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 3.30 seconds [01:05:58] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 29.67 seconds [01:09:58] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.78 seconds [02:08:37] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.27 seconds [02:08:58] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.58 seconds [02:19:00] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [02:48:27] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 458.06 seconds [02:48:57] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.69 seconds [03:00:27] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.84 seconds [03:00:58] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.22 seconds [03:03:07] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [03:03:28] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [03:18:27] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] [03:28:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 817.41 seconds [03:54:27] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 426.55 seconds [03:54:57] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 447.42 seconds [04:00:58] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.73 seconds [04:01:28] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.21 seconds [04:04:57] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 2.06 seconds [04:05:37] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 15.33 seconds [04:26:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 218.90 seconds [04:32:17] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.85 seconds [04:32:47] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.85 seconds [04:37:50] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string Picture not found on https://commons.wikimedia.org:443/wiki/Main_Page - 25071 bytes in 0.007 second response time [04:38:48] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 46.10 seconds [04:40:27] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:46:59] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 86793 bytes in 0.009 second response time [04:53:37] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 426.34 seconds [04:54:07] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 450.63 seconds [04:56:17] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.94 seconds [05:00:58] A certain admin sockpuppet has deleted many highly visible pages on Commons. We rae in the progress of restoring them and may create lots of slave lag. CC DBAs jynus marostegui volans|off [05:01:03] *are [05:01:08] legoktm: ^ [05:01:24] :thumbsup: [05:06:17] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [05:06:27] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.15 seconds [05:06:47] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:17:21] zhuyifei1999_, are you using a script to restore it? [05:17:30] yeah, writing one [05:18:09] zhuyifei1999_, okay, let me know if you need help. Are you an admin? [05:18:20] yes [05:18:22] zhuyifei1999_, you could also use pywikibot which has that. [05:18:26] ik [06:01:27] (03PS1) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [06:33:56] (03PS2) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [06:34:57] (03Abandoned) 10Jcrespo: db-readonly: Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) (owner: 10Jcrespo) [06:42:54] (03PS3) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [06:43:29] CI may not be working? [06:45:26] (03CR) 10Jcrespo: "There should be no functionality changes (in practice) on this patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) (owner: 10Jcrespo) [07:21:05] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3517929 (10Joe) I don't think we're safe to do this maintenance until we do rack all the new mediawiki machines. We have almost half of our capacity for MediaWiki in row D. We have... [07:21:35] (03PS1) 10Jcrespo: dblists: Add extra instances to dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) [07:28:39] (03CR) 10Giuseppe Lavagetto: base::service_unit: deprecate autolookup of templates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:30:52] (03PS1) 10Jcrespo: Set OOM_adj to -500 for mariadb- WMF package is normally dedicated [software] - 10https://gerrit.wikimedia.org/r/371448 (https://phabricator.wikimedia.org/T172494) [07:32:37] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [07:34:28] restarting --^ [07:35:08] !log restart pdfrender on scb1004 [07:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:37] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.007 second response time [07:45:44] re-pooled mw2256, should get back to DSH soon [07:53:37] RECOVERY - mediawiki-installation DSH group on mw2256 is OK: OK [08:02:06] (03CR) 10Gehel: base::service_unit: deprecate autolookup of templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [08:34:27] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3517960 (10elukey) Papaul applied the thermal paste on the CPU since it was basically not present, and send a `sos report` to DELL to get their support. I just re-pooled the host, let's see if it freezes again. [08:36:32] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3517961 (10elukey) Looking good: ``` [Tue Aug 8 13:45:05 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex [Tue Aug... [08:42:30] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3517966 (10Aklapper) > who knows what these were? @Dzahn: See T79946 which mentions [[ https://en.wikipedia.org/wiki/User:Ariley | Ariley ]] and [[ https://meta.wikimedia.org/wiki/User:Wol... [08:56:46] 10Operations, 10Puppet, 10User-Joe: Fix the `base::service_unit` template scoping problem - https://phabricator.wikimedia.org/T173078#3517983 (10Joe) [09:07:19] !log stopping and restarting db2046 for upgrade [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10elukey) Hello people, any timeline for these hosts? Don't mean to pressure, just knowing the timings to organize/schedule... [09:12:29] 10Operations, 10Performance-Team, 10User-Elukey, 10Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3518005 (10elukey) 05Open>03Resolved a:03elukey >>! In T125735#3424167, @elukey... [09:28:25] !log stopping and restarting es2013 for upgrade [09:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:55] (03PS2) 10Jcrespo: dbstore_multiinstance: All hosts other than dbstore2002 will have 8 instances [puppet] - 10https://gerrit.wikimedia.org/r/371073 (https://phabricator.wikimedia.org/T168409) [10:05:57] (03PS1) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [10:06:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [10:08:07] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [10:10:07] (03PS2) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [10:10:37] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [10:16:30] (03PS3) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [10:21:12] (03CR) 10Marostegui: dblists: Add extra instances to dbstore2001 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:22:49] (03PS9) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [10:23:14] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:23:19] (03PS2) 10Jcrespo: dblists: Add extra instances to dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) [10:23:25] (03CR) 10Jcrespo: "thanks" [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:23:44] (03CR) 10Marostegui: [C: 031] dblists: Add extra instances to dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:26:02] (03CR) 10Marostegui: mariadb: Remove package hacks for MariaDB 10.1 on jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [10:29:37] (03CR) 10Marostegui: [C: 031] mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) (owner: 10Jcrespo) [10:38:07] (03PS4) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [10:39:06] (03PS10) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [10:41:21] (03PS1) 10Giuseppe Lavagetto: wmflib: add init_template functions [puppet] - 10https://gerrit.wikimedia.org/r/371452 (https://phabricator.wikimedia.org/T173078) [10:41:23] (03PS1) 10Giuseppe Lavagetto: calico: convert calico-node to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371453 (https://phabricator.wikimedia.org/T173078) [10:44:35] (03CR) 10Jcrespo: [C: 032] Set OOM_adj to -500 for mariadb- WMF package is normally dedicated [software] - 10https://gerrit.wikimedia.org/r/371448 (https://phabricator.wikimedia.org/T172494) (owner: 10Jcrespo) [10:44:41] (03PS2) 10Jcrespo: Set OOM_adj to -500 for mariadb- WMF package is normally dedicated [software] - 10https://gerrit.wikimedia.org/r/371448 (https://phabricator.wikimedia.org/T172494) [10:45:03] (03CR) 10Jcrespo: [V: 032 C: 032] Set OOM_adj to -500 for mariadb- WMF package is normally dedicated [software] - 10https://gerrit.wikimedia.org/r/371448 (https://phabricator.wikimedia.org/T172494) (owner: 10Jcrespo) [10:47:01] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7403/" [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:47:14] (03PS3) 10Jcrespo: dblists: Add extra instances to dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) [10:49:44] (03CR) 10Jcrespo: "I merged the wrong review, I will keep it anyway." [software] - 10https://gerrit.wikimedia.org/r/371447 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:50:07] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7403/" [puppet] - 10https://gerrit.wikimedia.org/r/371073 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [11:09:19] (03PS11) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [11:14:29] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:14:30] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:14:39] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:15:20] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [11:16:29] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 4.005 second response time [11:19:39] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:39] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:56] !log Stop replication on db2046 to fix duplicate entries - T151029 [11:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:08] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [11:21:29] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [11:21:29] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 1.446 second response time [11:22:29] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [11:25:07] !log stopping and upgrading labsdb1010 [11:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:54] (03CR) 10Marostegui: [C: 031] mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [11:26:39] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [11:28:24] 2 proxies should complain soon [11:30:10] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:30:49] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:31:53] PROBLEM - mysqld processes on labsdb1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:32:11] mmm [11:32:19] that shouldn't have happend [11:33:46] Just checking in on labsdb [11:33:54] don't [11:33:57] see my log [11:34:23] mysqld process checking shouldn't be critical [11:34:38] when there is automatic failover [11:35:27] K, tx jynus [11:38:20] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [11:38:49] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [11:49:04] (03CR) 10Elukey: "pcc looks good for the first time: https://puppet-compiler.wmflabs.org/compiler02/7408/" [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [11:55:38] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371456 [11:55:42] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371456 [12:02:36] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371456 (owner: 10Marostegui) [12:04:09] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371456 (owner: 10Marostegui) [12:05:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2046 - T151029 (duration: 00m 48s) [12:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:46] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [12:05:49] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371456 (owner: 10Marostegui) [12:26:33] (03PS11) 10Elukey: role:eventubus: set deploy-service as scap deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/371014 (https://phabricator.wikimedia.org/T171506) [12:33:01] (03CR) 10Elukey: [C: 032] role:eventubus: set deploy-service as scap deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/371014 (https://phabricator.wikimedia.org/T171506) (owner: 10Elukey) [12:44:14] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [12:44:19] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 05s) [12:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:41] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [12:51:45] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 04s) [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:31] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:49] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 17s) [13:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:09] (03PS1) 10Elukey: role:eventbus: add conftool pool/depool cred to scap [puppet] - 10https://gerrit.wikimedia.org/r/371466 (https://phabricator.wikimedia.org/T171506) [13:07:11] 10Operations, 10MediaWiki-extensions-FlaggedRevs: Applying pending changes protection and extended confirmed users in idwiki - https://phabricator.wikimedia.org/T172838#3518275 (10Kenrick95) [13:09:03] (03CR) 10Elukey: [C: 032] role:eventbus: add conftool pool/depool cred to scap [puppet] - 10https://gerrit.wikimedia.org/r/371466 (https://phabricator.wikimedia.org/T171506) (owner: 10Elukey) [13:15:49] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [13:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 12s) [13:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:54] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [13:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:28] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 34s) [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:37] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [13:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:50] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 13s) [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:02] !log mobrovac@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [13:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:29] !log mobrovac@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 27s) [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:23] !log moved the eventbus scap deployment dirs on kafka[12]00[123] to deploy-service:deploy-service to allow scap to depool/pool - T171506 [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] T171506: eventlogging-service-eventbus scap deployments should depool/pool during deployment - https://phabricator.wikimedia.org/T171506 [13:51:35] ottomata: --^ \o/ [13:52:18] :) !yeehaw [13:59:21] marostegui: could you please investigate the fatal ref. WY2yRApAEKcAABSE9XgAAAAC ? [13:59:32] Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" [14:00:44] 2017-08-11 13:34:37 [WY2yRApAEKcAABSE9XgAAAAC] mw1212 metawiki 1.30.0-wmf.13 exception ERROR: [WY2yRApAEKcAABSE9XgAAAAC] /wiki/Special:NotifyTranslators Wikimedia\Rdbms\DBTransactionSizeError from line 1177 of /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 5.2927534580231 second(s) in writes, exceeding the 3 limit. {"exception_id":"WY2yRApAEKcAABSE9XgAAAAC","caught_by":" [14:00:44] mwe_handler"} [14:00:49] TabbyCat: Do you want a stack trace somewhere? [14:01:13] Reedy: https://phabricator.wikimedia.org/T160276 would be a good place [14:01:45] done [14:02:26] thanks [14:04:56] 10Operations, 10ops-eqiad: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3518439 (10Cmjohnson) [14:10:00] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3518454 (10elukey) [14:10:49] PROBLEM - HTTPS on netmon2001 is CRITICAL: SSL CRITICAL - Certificate librenms.wikimedia.org valid until 2017-08-14 14:10:00 +0000 (expires in 2 days) [14:13:10] (03PS1) 10Giuseppe Lavagetto: cassandra: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371478 (https://phabricator.wikimedia.org/T173078) [14:13:11] (03PS1) 10Giuseppe Lavagetto: celery: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371479 (https://phabricator.wikimedia.org/T173078) [14:13:13] (03PS1) 10Giuseppe Lavagetto: confluent: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371480 (https://phabricator.wikimedia.org/T173078) [14:13:15] (03PS1) 10Giuseppe Lavagetto: systemd::service: convert a bunch of modules to it [puppet] - 10https://gerrit.wikimedia.org/r/371481 (https://phabricator.wikimedia.org/T173078) [14:13:17] (03PS1) 10Giuseppe Lavagetto: prometheus: convert to systemd::service where needed [puppet] - 10https://gerrit.wikimedia.org/r/371482 [14:15:24] (03PS2) 10Giuseppe Lavagetto: wmflib: add init_template functions [puppet] - 10https://gerrit.wikimedia.org/r/371452 (https://phabricator.wikimedia.org/T173078) [14:17:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] wmflib: add init_template functions [puppet] - 10https://gerrit.wikimedia.org/r/371452 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [14:18:15] (03PS2) 10Giuseppe Lavagetto: calico: convert calico-node to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371453 (https://phabricator.wikimedia.org/T173078) [14:19:22] (03CR) 10Giuseppe Lavagetto: [C: 032] calico: convert calico-node to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371453 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [14:22:43] (03PS4) 10Elukey: Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) [14:22:52] (03PS12) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [14:22:54] (03PS11) 10Paladox: Gerrit: Upgrading gerrit to 2.14.3-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [14:27:02] (03CR) 10Elukey: [C: 032] Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [14:27:09] PROBLEM - Host stat1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:27:29] I'll ack this [14:33:25] (03PS1) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 [14:33:56] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (owner: 10Elukey) [14:36:03] (03PS2) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 [14:36:29] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (owner: 10Elukey) [14:37:13] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10elukey) [14:37:39] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/analytics-wmde/graphite] [14:39:05] this is probably me --^ [14:39:43] (03PS2) 10Giuseppe Lavagetto: cassandra: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371478 (https://phabricator.wikimedia.org/T173078) [14:39:58] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/7414/restbase2001.codfw.wmnet/ seems ok" [puppet] - 10https://gerrit.wikimedia.org/r/371478 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [14:42:03] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371478 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [14:45:26] (03PS1) 10Elukey: statistics: re-add working_path variable [puppet] - 10https://gerrit.wikimedia.org/r/371487 (https://phabricator.wikimedia.org/T152712) [14:47:26] (03CR) 10Elukey: [C: 032] statistics: re-add working_path variable [puppet] - 10https://gerrit.wikimedia.org/r/371487 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [14:49:49] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:51:13] (03PS2) 10Giuseppe Lavagetto: celery: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371479 (https://phabricator.wikimedia.org/T173078) [14:51:17] Bug: value must be a single phabricator task ID ? [14:52:15] that's weird [14:52:15] elukey the task you linked here https://gerrit.wikimedia.org/r/#/c/371486/ needs to have a T [14:52:16] :) [14:52:36] otherwise it's a bugzilla task [14:52:43] * elukey cries in a corner [14:52:44] (03PS3) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 [14:53:02] thanks! [14:53:10] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (owner: 10Elukey) [14:53:12] ok another -1 incoming [14:53:14] there you go [14:53:21] (03PS4) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T368612) [14:53:48] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T368612) (owner: 10Elukey) [14:53:51] elukey thanks, also the task does not exist https://phabricator.wikimedia.org/T368612 [14:55:36] yes yes and a blank line, I blame friday evening [14:55:38] :) [14:56:06] (03PS5) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T152712) [15:01:07] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7417/" [puppet] - 10https://gerrit.wikimedia.org/r/371479 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [15:01:39] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518544 (10elukey) [15:10:25] (03CR) 10Elukey: [C: 032] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [15:10:29] (03PS6) 10Elukey: Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T152712) [15:10:32] (03CR) 10Elukey: [V: 032 C: 032] Remove stat1002 from puppet as part of decom process [puppet] - 10https://gerrit.wikimedia.org/r/371486 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [15:11:53] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518586 (10elukey) [15:13:48] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10elukey) Just removed all the puppet references of stat1002 and disabled alarms. Please sync with Chris and check https://phabricator.wikimedia.org/T173094 before proceeding... [15:17:09] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7418/" [puppet] - 10https://gerrit.wikimedia.org/r/371480 (https://phabricator.wikimedia.org/T173078) (owner: 10Giuseppe Lavagetto) [15:17:14] <_joe_> elukey: I'm about to merge ^^ [15:18:27] (03PS2) 10Giuseppe Lavagetto: confluent: convert to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/371480 (https://phabricator.wikimedia.org/T173078) [15:18:30] <_joe_> I've verified it is a noop, still a heads-up [15:22:28] <_joe_> and of course dependency hell! [15:22:31] <_joe_> I'll fix it [15:24:14] ok! [15:24:29] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:45] (03PS1) 10Jcrespo: mariadb: Depool db2075 for cloning to dbstore2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371491 (https://phabricator.wikimedia.org/T168409) [15:25:00] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:09] <_joe_> that is me ^^ [15:25:09] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:10] (03PS1) 10Giuseppe Lavagetto: confluent: fixup for I161e43a2 [puppet] - 10https://gerrit.wikimedia.org/r/371492 [15:25:14] <_joe_> but just 2 hosts [15:25:16] <_joe_> 1 per cluster [15:26:01] (03CR) 10Giuseppe Lavagetto: [C: 032] confluent: fixup for I161e43a2 [puppet] - 10https://gerrit.wikimedia.org/r/371492 (owner: 10Giuseppe Lavagetto) [15:28:09] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:29:00] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:30:10] <_joe_> and... done [15:31:14] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3518627 (10ayounsi) >>! In T171970#3510082, @ayounsi wrote: > https://librenms.wikimedia.org/device/device=153/tab=port/port=13330/ (possibly MTU related) > Possible... [15:35:11] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2075 for cloning to dbstore2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371491 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [15:36:44] (03Merged) 10jenkins-bot: mariadb: Depool db2075 for cloning to dbstore2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371491 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [15:36:55] (03CR) 10jenkins-bot: mariadb: Depool db2075 for cloning to dbstore2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371491 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [15:38:39] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2040827 [15:38:49] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2075 (duration: 00m 48s) [15:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:02] !log installing git security updates on trusty (jessie/stretch already fixed) [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:06] !log stopping db2075 to clone it to dbstore2001 [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:39] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:07:04] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Add s5 to the dbstore2001 monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/371494 (https://phabricator.wikimedia.org/T168409) [16:21:45] !log stop db1069:s6 replication and dropping frwiki, jawiki, ruwiki [16:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:33] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2075 for cloning to dbstore2001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371497 [16:34:59] (03PS7) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [16:35:27] (03CR) 10jerkins-bot: [V: 04-1] mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [16:36:35] (03PS8) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [16:47:49] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2051344 [16:59:45] (03PS1) 10Jprorama: Add cross reference to k8s admin guide in wikitech [puppet] - 10https://gerrit.wikimedia.org/r/371501 [17:01:31] (03PS2) 10EBernhardson: Switch elastic1017 to LVM [puppet] - 10https://gerrit.wikimedia.org/r/371210 (https://phabricator.wikimedia.org/T169498) [17:04:48] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Add s5 to the dbstore2001 monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/371494 (https://phabricator.wikimedia.org/T168409) [17:05:04] !log Deploying phabricator security update [17:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:45] (03CR) 10Jcrespo: [C: 032] prometheus-mysqld-exporter: Add s5 to the dbstore2001 monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/371494 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [17:15:45] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2075 for cloning to dbstore2001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371497 (owner: 10Jcrespo) [17:17:13] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2075 for cloning to dbstore2001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371497 (owner: 10Jcrespo) [17:17:23] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2075 for cloning to dbstore2001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371497 (owner: 10Jcrespo) [17:19:06] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2075 (duration: 00m 47s) [17:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:49] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2007111 [17:35:25] (03PS4) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [18:28:46] !log installing subversion security updates [18:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:29] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:30:29] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [18:31:40] ^ was just me taking some heap dumps which paused a few es servers .. [18:34:19] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:34:20] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [18:35:14] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3519246 (10ayounsi) [18:35:16] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3519245 (10ayounsi) [18:36:10] same, it was a brief blip and is already back to normal [18:37:11] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499267 (10ayounsi) 05Open>03stalled Sounds fair :) Marking this task as a dependency of T165519. Any idea of the time-line for T165519? [18:37:40] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3519249 (10ayounsi) [18:38:09] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:29] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [18:40:29] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [19:07:29] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:15:26] (03PS1) 10TheDJ: Add Timeless skin to test and mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) [19:21:10] (03CR) 10BryanDavis: Add Timeless skin to test and mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) (owner: 10TheDJ) [19:30:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [19:42:32] (03PS2) 10Legoktm: Add Timeless skin to test and mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) (owner: 10TheDJ) [19:44:00] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [19:45:59] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): /{domain}/v1/translation/articles/{source}{/seed} (bad seed) timed out before a response was received [19:46:09] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [19:46:09] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [19:46:49] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [19:46:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [19:46:59] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [19:47:09] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [19:47:29] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [19:47:49] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [19:49:29] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [19:49:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [19:50:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [19:51:19] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [19:52:09] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [19:59:52] !log ban elastic1017 from eqiad search cluster [20:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:57] should recover momentarily [20:06:23] (1017 is the one server i havn't applied the partial io fix on, so i still had one completely misbehaving server to test against) [20:09:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:14:34] Isarra and I are going to begin deploying the Timeless skin [20:14:48] :D [20:15:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:16:07] there's opsen in the room next to them, for the record [20:17:45] ok yeah [20:17:49] I have access to all the normal things again [20:19:17] !log varnish backend restart on cp1049 + cp1074 (mailbox lag) [20:19:24] 10Operations, 10Pybal, 10Traffic: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#3519518 (10ema) >>! In T82849#3503596, @ema wrote: > A more general patch has been submitted by Julian Anastasov http://archive.linuxvirtualserver.org/html/lvs-devel/2017-08/m... [20:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:59] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [20:28:49] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [20:31:10] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:31:10] (03CR) 10Legoktm: [C: 032] Add Timeless skin to test and mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) (owner: 10TheDJ) [20:33:08] (03Merged) 10jenkins-bot: Add Timeless skin to test and mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) (owner: 10TheDJ) [20:33:21] (03CR) 10jenkins-bot: Add Timeless skin to test and mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371519 (https://phabricator.wikimedia.org/T154371) (owner: 10TheDJ) [20:35:26] (03CR) 10Ema: [C: 032] Add metric pybal_service_depool_threshold [debs/pybal] - 10https://gerrit.wikimedia.org/r/371185 (https://phabricator.wikimedia.org/T171710) (owner: 10Mark Bergsma) [20:36:55] !log legoktm@tin Started scap: Deploying Timeless - T154371 [20:37:01] no one look at the day of the week [20:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:08] T154371: Review and deploy Timeless skin - https://phabricator.wikimedia.org/T154371 [20:37:10] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:40:26] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3519593 (10RobH) [20:40:50] (03PS1) 10Filippo Giunchedi: hieradata: create pagecompilation account [puppet] - 10https://gerrit.wikimedia.org/r/371579 (https://phabricator.wikimedia.org/T172123) [20:41:57] greg-g, yay for deploying past 4PM on a Friday :) [20:42:02] 10Operations, 10Analytics, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10RobH) [20:42:05] !log legoktm@tin scap aborted: Deploying Timeless - T154371 (duration: 05m 10s) [20:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:17] T154371: Review and deploy Timeless skin - https://phabricator.wikimedia.org/T154371 [20:42:45] Krenair: it's only 1pm in SF! ;) [20:42:59] (well, almost 2, but whatever) [20:44:33] Krenair: And it's still morning in Hawai'i. [20:46:00] people are already drunk in europe, though [20:46:01] !log legoktm@tin Started scap: Deploying Timeless (try 2) - T154371 [20:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:12] (03PS1) 10Filippo Giunchedi: profile: fix udev reload dependency for swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/371582 [20:52:40] godog: where are you guys? [20:52:52] !log varnish backend restart on cp1099 [20:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:17] win 14 [20:53:19] grr [20:53:30] mark: hackathon room [20:55:26] * godog nods [20:56:00] wheres that? [20:56:23] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3510136 (10RobH) So the raid0 device to disk is not a 1:1 mapping, so while VD2 (raid0 of a single disk) has failed, its actually the HDD is slot 1: ``` Enclosure Device ID: 32 Slot Num... [20:57:38] mark: floor 3 room 7 (behind the elevators) [20:58:10] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3510136 (10RobH) a:03Cmjohnson [21:02:25] !log unban elastic1017 from elasticsearch cluster [21:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:14] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#3519701 (10RobH) I'd think we want to push more of them to raid1-lvm-ext4-srv-noswap.cfg. The only difference between that and raid1-lvm-ext4-srv.cfg is the use of a swap file. I'd suggest we eliminat... [21:06:34] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#3519702 (10RobH) If we are fine with that, I'm happy to re-image these hosts! [21:07:55] AaronSchulz, DELETE FROM globalimagelinks WHERE gil_wiki='ukwikimedia' LIMIT 500 [21:10:18] bblack: godog where'd ya'll go? Main room? the deploy is about done (they had to restart) [21:11:58] MaxSem: lgtm, several decent indexes too [21:23:59] PROBLEM - MD RAID on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:00] PROBLEM - Check systemd state on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:10] PROBLEM - DPKG on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:19] PROBLEM - puppet last run on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:19] PROBLEM - nutcracker process on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:19] PROBLEM - HHVM processes on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:19] PROBLEM - salt-minion processes on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:20] PROBLEM - SSH on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:24:29] PROBLEM - dhclient process on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:39] PROBLEM - configured eth on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:49] PROBLEM - Check whether ferm is active by checking the default input chain on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:25:23] legoktm, ^ did it break? [21:25:24] 10Operations: mw2256 is down - https://phabricator.wikimedia.org/T173148#3519732 (10Legoktm) [21:25:52] guess so [21:33:50] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3519783 (10MoritzMuehlenhoff) Crashed again, see T173148 [21:34:28] 10Operations: mw2256 is down - https://phabricator.wikimedia.org/T173148#3519732 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Tracked via T163346 [21:37:54] moritzm: can you depool it? [21:39:39] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2256 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:11] !log legoktm@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.unknown-but-probably-mediawiki.lock"; owner is "legoktm"; reason is "Deploying Timeless (try 2) - T154371" (duration: 00m 00s) [21:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:24] T154371: Review and deploy Timeless skin - https://phabricator.wikimedia.org/T154371 [21:40:32] !log legoktm@tin Started scap: (no justification provided) [21:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:02] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw2256.codfw.wmnet [21:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:11] legoktm: yep, done [21:43:52] 10Operations, 10Traffic, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3519793 (10RobH) >>! In T148131#2920097, @BBlack wrote: > These are now deployed (digicert in esams, globalsign elsewhere). Pending closing this until we document switching off either of... [21:45:34] legoktm, so [21:45:40] scap failed, but [21:45:46] it seems to have deployed anyway [21:46:19] Krenair: it was nearly done but I tried to kill the stuck ssh on mw2256 and it accidentally killed scap too [21:46:23] so now I'm rescapping [21:46:26] :D [21:46:28] ok [21:46:45] ah, I was wondering what happened there :) [21:47:42] !log legoktm@tin Finished scap: (no justification provided) (duration: 07m 09s) [21:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:04] DONE [21:55:39] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [22:06:57] 10Operations, 10Ops-Access-Requests: Requesting @ops in#wikimedia-tech for Luke081515 - https://phabricator.wikimedia.org/T172793#3519816 (10RobH) p:05Triage>03Normal [22:07:05] 10Operations, 10Ops-Access-Requests: Requesting @ops in #wikimedia-tech for Luke081515 - https://phabricator.wikimedia.org/T172793#3509528 (10RobH) [22:13:02] 10Operations, 10Traffic: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3519829 (10RobH) [22:51:25] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3519888 (10Dzahn) a:03Dzahn