[00:01:09] 10Operations, 10Cloud-Services, 10RESTBase, 10Services, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1358396 (10Krinkle) @GWicke At which point was wikimedia.org (or www.wikimedia.org?) a wiki? Assuming this would've been the foundation wiki (cur... [00:02:04] Krinkle: I will talk about it with Lydia and do whatever she says [00:05:18] I need to go [00:05:19] o/ [00:31:09] (03PS1) 10Dzahn: rsync/releases: add dest_host parameter, only include what's needed [puppet] - 10https://gerrit.wikimedia.org/r/363105 (https://phabricator.wikimedia.org/T164030) [00:35:48] (03PS2) 10Dzahn: rsync/releases: add dest_host parameter, only include what's needed [puppet] - 10https://gerrit.wikimedia.org/r/363105 (https://phabricator.wikimedia.org/T164030) [00:38:51] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6922/" [puppet] - 10https://gerrit.wikimedia.org/r/363105 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:38:57] (03PS3) 10Dzahn: rsync/releases: add dest_host parameter, only include what's needed [puppet] - 10https://gerrit.wikimedia.org/r/363105 (https://phabricator.wikimedia.org/T164030) [00:46:37] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 394.86 seconds [00:46:37] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 392.87 seconds [00:46:47] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 390.46 seconds [00:46:57] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.41 seconds [00:47:07] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.38 seconds [00:47:17] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 401.24 seconds [00:53:15] (03PS1) 10Dzahn: rsync::quickdatacopy: fix commandline for cron, don't need file_path [puppet] - 10https://gerrit.wikimedia.org/r/363106 (https://phabricator.wikimedia.org/T164030) [00:53:33] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: fix commandline for cron, don't need file_path [puppet] - 10https://gerrit.wikimedia.org/r/363106 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:53:51] (03PS2) 10Dzahn: rsync::quickdatacopy: fix commandline for cron, don't need file_path [puppet] - 10https://gerrit.wikimedia.org/r/363106 (https://phabricator.wikimedia.org/T164030) [00:56:01] (03CR) 10Dzahn: [C: 032] rsync::quickdatacopy: fix commandline for cron, don't need file_path [puppet] - 10https://gerrit.wikimedia.org/r/363106 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:56:07] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 27.53 seconds [00:56:17] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [00:56:37] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:56:37] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:56:47] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:56:57] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.66 seconds [01:02:07] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1499130124 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9119385 keys, up 2 minutes 2 seconds - replication_delay is 1499130124 [01:02:07] PROBLEM - Check health of redis instance on 6381 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:02:17] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:37] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:02:47] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:07] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9113212 keys, up 3 minutes 1 seconds - replication_delay is 0 [01:03:07] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9110242 keys, up 3 minutes 1 seconds - replication_delay is 0 [01:03:08] RECOVERY - Check health of redis instance on 6381 on rdb1004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9015990 keys, up 3 minutes 2 seconds [01:03:27] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9016038 keys, up 3 minutes 24 seconds - replication_delay is 0 [01:03:37] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9108191 keys, up 3 minutes 29 seconds - replication_delay is 0 [01:06:05] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3403069 (10Dzahn) release files are now auto-rsynced: ``` [releases1001:~] $ sudo crontab -l | grep releases *... [01:19:20] (03CR) 10Kaldari: [C: 031] "Please schedule for SWAT deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [01:28:11] 10Operations, 10Cloud-Services, 10RESTBase, 10Services, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#3403100 (10GWicke) @krinkle, your comments sounds like it might have been intended for {T133178}. [01:30:02] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#2224310 (10GWicke) @krinkle, I don't know the exact time period www.wikimedia.org was active as a wiki, and yes it was the foundation wiki. II... [01:30:26] !log releases1001: switching GID of reprepro and promemetheus-node-exporter group (1000 vs 1001), changing reprepro UID to 13927. using find -exec to fix all the permissions and make it identical to bromine. prevent permissions snafu when rsyncing (T164030) [01:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:39] T164030: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030 [01:36:25] (03PS1) 10Dzahn: releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) [01:36:48] (03CR) 10jerkins-bot: [V: 04-1] releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) (owner: 10Dzahn) [01:37:11] (03PS2) 10Dzahn: releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) [01:37:33] (03CR) 10jerkins-bot: [V: 04-1] releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) (owner: 10Dzahn) [01:38:04] (03PS3) 10Dzahn: releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) [01:38:25] (03CR) 10jerkins-bot: [V: 04-1] releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) (owner: 10Dzahn) [01:38:29] (03PS4) 10Dzahn: releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) [01:41:00] (03CR) 10Dzahn: [C: 032] releases: add releasers-mobile admin group to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/363107 (https://phabricator.wikimedia.org/T164040) (owner: 10Dzahn) [01:43:31] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3403113 (10Dzahn) permission issue fixed ^ , looks like this, just like on bromine, and also stays like that afte... [01:45:00] (03CR) 10Dzahn: "maybe i should add a warning in the comments that it needs PHP 7.1 and then we merge" [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [01:50:37] (03PS2) 10Dzahn: wikimania_scholarships: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362137 [01:51:42] (03CR) 10Dzahn: wikimania_scholarships: add support for stretch and PHP7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362137 (owner: 10Dzahn) [01:53:23] (03PS3) 10Dzahn: wikimania_scholarships: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362137 [01:59:04] (03PS4) 10Dzahn: wikimania_scholarships: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362137 [02:02:23] 10Operations, 10MediaWiki-JobQueue, 10Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#3403134 (10Krinkle) [02:04:19] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3403138 (10Aklapper) [02:09:40] (03CR) 10Dzahn: "modules/service/manifests/deploy/scap.pp seems to create this user but says "# Deprecated: you should use scap::target directly instead of" [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [02:16:05] (03CR) 10Dzahn: [C: 04-1] "ah, here you go: "ores::web" uses "service::uwsgi" with $deployment set to scap3. service::uwsgi uses scap::target in that case. And that " [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [02:18:46] (03CR) 10Dzahn: [C: 04-1] "can you make a copy of role::scb and just remove all the other non-ores roles inside that while keeping standard/base::firewall and ::prof" [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [02:19:17] (03CR) 10Dzahn: [C: 032] wikimania_scholarships: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362137 (owner: 10Dzahn) [02:23:04] (03CR) 10Dzahn: "i'm not sure i fully understand this but i see that 20after4 accepted the revision on https://phabricator.wikimedia.org/D702 | adding him" [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [02:25:28] (03CR) 10Dzahn: [C: 031] "since we don't really use drafts, this _seems_ right" [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [02:27:36] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 10m 14s) [02:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:37] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 373.27 seconds [02:38:47] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.73 seconds [02:38:48] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.75 seconds [02:39:07] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 392.07 seconds [02:39:08] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 392.99 seconds [02:39:17] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 396.61 seconds [02:47:58] 10Operations, 10hardware-requests: Reclaim/Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3403200 (10Dzahn) a:03Dzahn [02:48:05] 10Operations, 10hardware-requests: Reclaim/Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3400130 (10Dzahn) p:05Triage>03Normal [02:48:38] (03PS1) 10Dzahn: decom subra and suhail [puppet] - 10https://gerrit.wikimedia.org/r/363110 (https://phabricator.wikimedia.org/T169506) [02:48:47] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.64 seconds [02:48:47] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.65 seconds [02:49:07] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.45 seconds [02:49:17] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.17 seconds [02:50:27] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.75 seconds [02:50:37] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.52 seconds [02:50:47] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.10 seconds [02:50:47] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.16 seconds [02:51:07] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.49 seconds [02:51:17] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.37 seconds [02:55:48] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 36.23 seconds [02:55:48] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 36.30 seconds [02:56:07] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [02:56:17] (03CR) 1020after4: [C: 031] "dzahn: The various url patterns in this config section control the destination of various (diffusion) links within the gerrit ui. This cha" [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [02:56:17] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [02:56:27] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [02:56:37] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:59:35] (03PS1) 10Krinkle: grafana: Add legend to dashboard varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/363111 [02:59:57] (03CR) 10Krinkle: "Preview at https://grafana-admin.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes" [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle) [04:42:37] 10Operations, 10Pybal, 10Traffic, 10User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#3403250 (10ema) Twisted uses `ThreadedResolver` (internet/base.py) as the default resolver (I got sidetracked for a while and thought that ResolverBase in names/common.py was the one, b... [04:47:06] (03CR) 10Niharika29: "This is stalled on https://gerrit.wikimedia.org/r/#/c/362320/ which is in turn stalled on CA fixes. But it's probably fine if you override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [04:52:08] (03CR) 10Chad: [C: 031] "I introduced it in Gerrit 2.7, but it seems to have been removed at some point." [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [05:08:50] !log Deploy alter table on s2 directly on s2 master (db1054) - T168661 [05:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:01] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:10:34] (03PS4) 10Ema: VCL: remove disableImages handling [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) [05:11:01] (03CR) 10Ema: [V: 032 C: 032] "> I ininitially said 1 month due to HTTP caching, however if hashing" [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [05:15:09] (03PS2) 10Marostegui: site.pp: Add db1102 sanitarium role [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) [05:20:58] !log Deploy alter table on s6 directly on s6 master (db1061) - T168661 [05:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:08] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:26:36] !log Deploy alter table on s5 directly on s5 master (db1063) - T168661 [05:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:48] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:29:29] (03CR) 10Marostegui: [C: 032] site.pp: Add db1102 sanitarium role [puppet] - 10https://gerrit.wikimedia.org/r/362996 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [05:39:25] (03PS1) 10Marostegui: db1060.yaml: Change to ROW binlog format [puppet] - 10https://gerrit.wikimedia.org/r/363122 (https://phabricator.wikimedia.org/T153743) [05:40:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363123 (https://phabricator.wikimedia.org/T153743) [05:42:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363123 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [05:43:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363123 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [05:43:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363123 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [05:43:47] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/6923/" [puppet] - 10https://gerrit.wikimedia.org/r/363122 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [05:44:24] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:45:42] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [05:47:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T153743 (duration: 02m 51s) [05:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:21] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [06:05:52] !log Stop MySQL on db1060 for maintenance - T153743 [06:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:03] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [06:08:06] (03CR) 10Marostegui: [C: 032] db1060.yaml: Change to ROW binlog format [puppet] - 10https://gerrit.wikimedia.org/r/363122 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [06:11:22] 10Operations, 10DNS, 10Pybal, 10Traffic, 10Patch-For-Review: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1402514 (10ema) >>! In T103921#1417034, @mark wrote: > PyBal resolves IPs once at startup for managing the IPVS state. Did we neglect to extend that to Idl... [06:44:52] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [06:53:03] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1206.60 Read Requests/Sec=2286.40 Write Requests/Sec=27.20 KBytes Read/Sec=36683.60 KBytes_Written/Sec=3174.80 [06:54:04] 10Operations, 10Prometheus-metrics-monitoring: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3403317 (10ema) [06:54:26] 10Operations, 10Traffic, 10Prometheus-metrics-monitoring: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3403331 (10ema) [06:57:15] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3403334 (10MoritzMuehlenhoff) That just addressed a hardware error, this should continue to be an API server as before. [07:01:12] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=22.00 Read Requests/Sec=0.80 Write Requests/Sec=1.30 KBytes Read/Sec=42.80 KBytes_Written/Sec=31.60 [07:11:34] (03CR) 10WMDE-leszek: "> Do you want this to be deployed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [07:15:50] !log Deploy alter table on s3 hosts (eqiad) - T168661 [07:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:01] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:18:56] 10Operations, 10Pybal, 10Traffic, 10User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#3403354 (10MoritzMuehlenhoff) If we use rotate in resolv.conf, I guess we'd need to couple it with "options timeout:1", otherwise it wouldn't be very effeective with the default timeout... [07:22:52] (03CR) 10Muehlenhoff: [C: 031] decom subra and suhail [puppet] - 10https://gerrit.wikimedia.org/r/363110 (https://phabricator.wikimedia.org/T169506) (owner: 10Dzahn) [07:23:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:24:36] seems another spike like yesterday --^ [07:24:48] already recovered afaics [07:24:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:28:09] a lot of codfw ints (and also ulsfo, but should be because of codfw IIUC) [07:30:19] Cc: ema,bblack [07:30:19] (03PS1) 10Muehlenhoff: Extend account dates for two frtech consultants [puppet] - 10https://gerrit.wikimedia.org/r/363128 [07:31:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:32:20] (03CR) 10Muehlenhoff: [C: 032] Extend account dates for two frtech consultants [puppet] - 10https://gerrit.wikimedia.org/r/363128 (owner: 10Muehlenhoff) [07:32:52] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:42:33] !log rebooting video scalers in eqiad for kernel update [07:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:49] !log restart of relforge for kernel upgrade [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:59] !log powercycling mw1259, stuck in reboot [07:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:00] volans: about tox upgrade, would you mind filling a new task please ? ( https://phabricator.wikimedia.org/T46443#489926 is from 3 years ago and really obsolete :D ) [08:08:11] hashar: sure! [08:08:36] gotta test out tox with various review. From a quick look yesterday, it seems 2.0 filters/clean environment variables [08:08:42] which might or might not be an issue [08:09:37] 10Operations, 10media-storage, 10User-fgiunchedi: Complete stretch reimage for ms-fe / ms-be fleet - https://phabricator.wikimedia.org/T169601#3403388 (10fgiunchedi) [08:11:39] hashar: done T169602 [08:11:39] T169602: Upgrade tox on CI instances - https://phabricator.wikimedia.org/T169602 [08:12:32] (03PS4) 10Gehel: scap3 - deployment of package requires configuration to already exist [puppet] - 10https://gerrit.wikimedia.org/r/362155 (https://phabricator.wikimedia.org/T169011) [08:12:56] !log powercycling mw1260, stuck in reboot [08:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:46] (03PS1) 10Filippo Giunchedi: install_server: use stretch for ms-be / ms-fe [puppet] - 10https://gerrit.wikimedia.org/r/363132 (https://phabricator.wikimedia.org/T169601) [08:17:28] (03CR) 10Filippo Giunchedi: [C: 032] install_server: use stretch for ms-be / ms-fe [puppet] - 10https://gerrit.wikimedia.org/r/363132 (https://phabricator.wikimedia.org/T169601) (owner: 10Filippo Giunchedi) [08:17:29] (03PS2) 10Filippo Giunchedi: install_server: use stretch for ms-be / ms-fe [puppet] - 10https://gerrit.wikimedia.org/r/363132 (https://phabricator.wikimedia.org/T169601) [08:17:38] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: use stretch for ms-be / ms-fe [puppet] - 10https://gerrit.wikimedia.org/r/363132 (https://phabricator.wikimedia.org/T169601) (owner: 10Filippo Giunchedi) [08:24:06] 10Operations, 10MobileFrontend, 10Traffic, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking): Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3403448 (10phuedx) 05Open>03Resolved a:03phuedx This looks like it's done. Feel free to reopen if it ain't. Obvs. Than... [08:25:20] !log restart redis 6380 (slave) jobqueue instance on rdb1004/2003 to force resync with master [08:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:45] (03PS1) 10Marostegui: s2.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363137 (https://phabricator.wikimedia.org/T153743) [08:27:46] (03PS2) 10Marostegui: s2.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363137 (https://phabricator.wikimedia.org/T153743) [08:29:31] (03CR) 10Marostegui: [C: 032] s2.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363137 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:30:24] (03Merged) 10jenkins-bot: s2.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363137 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:30:43] the idea for rdb1004/2003 is to figure out if I can get the diff between the two AOF before they are rewritten to compress them [08:38:08] (03CR) 10Gehel: [C: 032] scap3 - deployment of package requires configuration to already exist [puppet] - 10https://gerrit.wikimedia.org/r/362155 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [08:38:15] (03PS5) 10Gehel: scap3 - deployment of package requires configuration to already exist [puppet] - 10https://gerrit.wikimedia.org/r/362155 (https://phabricator.wikimedia.org/T169011) [08:39:42] !log rebooting sca2* for kernel update [08:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:00] (03PS1) 10Marostegui: redact_sanitarium: Add db1102 to the allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/363138 (https://phabricator.wikimedia.org/T153743) [08:43:34] damn puppet dependency cycles! [08:43:47] (03PS1) 10Gehel: Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/363139 [08:45:05] (03CR) 10Gehel: [C: 032] Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/363139 (owner: 10Gehel) [08:46:09] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Ping?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [08:46:12] (03CR) 10Jcrespo: [C: 031] redact_sanitarium: Add db1102 to the allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/363138 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:46:44] (03PS2) 10Marostegui: redact_sanitarium: Add db1102 to the allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/363138 (https://phabricator.wikimedia.org/T153743) [08:46:55] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:47:51] (03CR) 10Marostegui: [C: 032] redact_sanitarium: Add db1102 to the allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/363138 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:54:10] 10Operations: Default to ext4 instead of ext3 - https://phabricator.wikimedia.org/T169605#3403533 (10fgiunchedi) [08:54:30] !log Run redact_sanitarium on db1102 (sanitarium3) - T153743 [08:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:40] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:01:08] (03PS1) 10Filippo Giunchedi: install_server: ext4 for ms-be non-data filesystems [puppet] - 10https://gerrit.wikimedia.org/r/363142 (https://phabricator.wikimedia.org/T169605) [09:02:30] 10Operations, 10Patch-For-Review: Default to ext4 instead of ext3 - https://phabricator.wikimedia.org/T169605#3403566 (10MoritzMuehlenhoff) I think it's safe to use ext4 everywhere we where using ext3 before, I can't think of a potential downside. [09:04:36] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/363142 (https://phabricator.wikimedia.org/T169605) (owner: 10Filippo Giunchedi) [09:06:51] (03PS2) 10Filippo Giunchedi: install_server: ext4 for ms-be non-data filesystems [puppet] - 10https://gerrit.wikimedia.org/r/363142 (https://phabricator.wikimedia.org/T169605) [09:08:01] (03PS3) 10Jcrespo: mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 [09:10:45] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:25] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:15] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 74812 bytes in 0.321 second response time [09:15:05] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:18:22] (03PS1) 10Gehel: scap3 - deployment of package requires configuration to already exist [puppet] - 10https://gerrit.wikimedia.org/r/363144 (https://phabricator.wikimedia.org/T169011) [09:20:16] (03CR) 10Gehel: [C: 032] scap3 - deployment of package requires configuration to already exist [puppet] - 10https://gerrit.wikimedia.org/r/363144 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [09:24:49] (03CR) 10Filippo Giunchedi: [C: 032] install_server: ext4 for ms-be non-data filesystems [puppet] - 10https://gerrit.wikimedia.org/r/363142 (https://phabricator.wikimedia.org/T169605) (owner: 10Filippo Giunchedi) [09:24:55] (03PS3) 10Filippo Giunchedi: install_server: ext4 for ms-be non-data filesystems [puppet] - 10https://gerrit.wikimedia.org/r/363142 (https://phabricator.wikimedia.org/T169605) [09:27:31] 10Operations, 10Pybal, 10Traffic, 10User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#3403680 (10ema) >>! In T154759#3403354, @MoritzMuehlenhoff wrote: > If we use rotate in resolv.conf, I guess we'd need to couple it with "options timeout:1", otherwise it wouldn't be ve... [09:27:54] !log rebooting thumbor1001/1002 for kernel updates [09:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:31] 10Operations, 10media-storage, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3403729 (10fgiunchedi) [09:38:12] !log rebooting restbase2002-restbase2004 for kernel updates [09:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:45] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:48:09] 10Operations, 10media-storage, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3403755 (10fgiunchedi) The template uses `@facts['numa']['device_to_htset'][@numa_iface]` and `numa_iface` is set as below. `numa_networking` seems to be disabled everywhere b... [09:49:25] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.165 and port 9042: Connection refused [09:49:35] PROBLEM - cassandra-a SSL 10.192.16.165:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:50:21] 10Operations, 10media-storage, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3403756 (10fgiunchedi) Facts from ms-be2005 `facter --puppet` (as root) ``` interface_primary => enp5s0f0 numa => {"device_to_node"=>{"enp5s0f0"=>[0], "enp5s0f1"=>[0], "lo"=>... [09:50:55] PROBLEM - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.166 and port 9042: Connection refused [09:51:00] ^ fixing downtime [09:51:15] PROBLEM - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:52:05] 10Operations, 10media-storage, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3403776 (10fgiunchedi) [09:54:38] !log Stop replication on db1095 for maintenance - T153743 [09:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:50] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:58:25] RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.165 port 9042 [09:58:52] !log Move labsdb1009 main general replication thread to a named replication thread called db1095 - T153743 [09:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363151 (https://phabricator.wikimedia.org/T168661) [10:09:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363151 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:10:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363151 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:10:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363151 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:12:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363152 [10:14:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1035 - T168661 (duration: 02m 50s) [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:52] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:14:57] (03PS3) 10Thiemo Mättig (WMDE): mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [10:15:37] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Is anybody willing to merge this patch with the redirect in place? If not, is anybody willing to merge this patch when we just remove the " [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [10:16:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363152 (owner: 10Marostegui) [10:18:58] (03CR) 10Ladsgroup: [C: 031] "I added it for SWAT for tomorrow, There is no train this (because of 4th of July) so I think we are fine for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [10:19:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363152 (owner: 10Marostegui) [10:19:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1035" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363152 (owner: 10Marostegui) [10:23:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1035 - T168661 (duration: 02m 49s) [10:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:03] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:26:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363153 [10:27:09] (03CR) 10Filippo Giunchedi: [C: 031] Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) (owner: 10GWicke) [10:27:20] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363153 [10:28:59] (03CR) 10Alexandros Kosiaris: [C: 032] Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) (owner: 10GWicke) [10:29:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363153 (owner: 10Marostegui) [10:29:07] (03PS5) 10Alexandros Kosiaris: Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) (owner: 10GWicke) [10:29:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) (owner: 10GWicke) [10:30:10] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: add more monitors to HDFS metrics [puppet] - 10https://gerrit.wikimedia.org/r/363154 (https://phabricator.wikimedia.org/T163908) [10:30:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363153 (owner: 10Marostegui) [10:30:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363153 (owner: 10Marostegui) [10:33:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T153743 (duration: 02m 49s) [10:33:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363155 (https://phabricator.wikimedia.org/T153743) [10:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:34:33] (03PS2) 10Elukey: role::analytics_cluster::hadoop::master: add more monitors to HDFS metrics [puppet] - 10https://gerrit.wikimedia.org/r/363154 (https://phabricator.wikimedia.org/T163908) [10:36:20] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop::master: add more monitors to HDFS metrics [puppet] - 10https://gerrit.wikimedia.org/r/363154 (https://phabricator.wikimedia.org/T163908) (owner: 10Elukey) [10:40:29] !log Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743 [10:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:39] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:45:33] 10Operations, 10media-storage, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3403729 (10akosiaris) Turns out this is related to the `stringify_facts` setting. The code assumes that facts are structured and not stringified (and doing so correctly), but... [10:52:02] (03PS1) 10Muehlenhoff: Remove smtp port from ferm config [puppet] - 10https://gerrit.wikimedia.org/r/363164 [10:53:58] !log killing stuck wmf-reimage on puppetmaster1001 for maps-test2001 [10:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:17] !log joal@tin Started deploy [analytics/refinery@12c5f57]: Regular weekly deploy [10:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:28] !log copy wikimedia-lvs-realserver from jessie-wikimedia to stretch-wikimedia [10:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:38] !log rebooting kubernetes workers for kernel update [11:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:00] PROBLEM - Host elastic1018 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:05] !log joal@tin Finished deploy [analytics/refinery@12c5f57]: Regular weekly deploy (duration: 04m 47s) [11:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:09:01] ACKNOWLEDGEMENT - HP RAID on ms-be2024 is CRITICAL: NRPE: Command check_raid_hpssacli not defined nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169619 [11:09:05] 10Operations, 10ops-codfw: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T169619#3403963 (10ops-monitoring-bot) [11:09:12] (03PS9) 10Elukey: role::analytics_cluster::refinery::job::data_drop: drop old druid data [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [11:09:34] godog: nuuu, I guess I need to add to the skip list this error too [11:09:54] elastic1018 died [11:09:57] gehel: ^ [11:10:04] volans: odd though the host was downtimed [11:10:38] dcausse: I can check if you want, probably need a powercycle? [11:10:49] (03CR) 10Elukey: [C: 032] role::analytics_cluster::refinery::job::data_drop: drop old druid data [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [11:10:51] elukey: yes, I can't login [11:11:20] dcausse: all right, checking [11:13:54] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [11:15:10] !log powercycle elastic1018, host unreachable [11:15:14] RECOVERY - Host elastic1018 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [11:15:19] dcausse: done! [11:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:33] (03CR) 10Faidon Liambotis: [C: 032] aptrepo: add hp-mcp to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [11:16:36] elukey: thanks! [11:19:57] (03CR) 10Muehlenhoff: [C: 031] aptrepo: add hp-mcp to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [11:21:28] !log joal@tin Started deploy [analytics/refinery@88cbb9e]: Regular weekly deploy (2) - Bug patch [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:10] !log joal@tin Finished deploy [analytics/refinery@88cbb9e]: Regular weekly deploy (2) - Bug patch (duration: 03m 38s) [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:59] (03PS1) 10Faidon Liambotis: salt: drop the saltversion fact, use os_version() [puppet] - 10https://gerrit.wikimedia.org/r/363169 [11:33:22] (03PS1) 10Hashar: WMF: add nodes with SSH host key verification [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/363170 (https://phabricator.wikimedia.org/T164543) [11:40:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Add comments to db1039 status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363171 [11:41:10] (03PS2) 10Marostegui: Revert "db-eqiad.php: Add comments to db1039 status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363171 [11:43:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Add comments to db1039 status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363171 (owner: 10Marostegui) [11:44:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Add comments to db1039 status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363171 (owner: 10Marostegui) [11:45:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Add comments to db1039 status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363171 (owner: 10Marostegui) [11:47:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove comments from db1039 status - T166208 (duration: 02m 50s) [11:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:03] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [11:48:54] (03PS1) 10Hashar: 0.1.1-wmf8: add nodes with SSH host key verification [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/363174 (https://phabricator.wikimedia.org/T164543) [11:49:34] 10Operations, 10Pybal, 10Traffic, 10User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#3404022 (10ema) @mark's comment on a related issue (T103921#1417034) made me go dig a bit deeper into our IdleConnection monitor implementation. The version currently in prod [[https://... [11:50:13] ema: hah! [11:50:58] (03CR) 10Hashar: "Patch is incorporated in the Debian packages via https://gerrit.wikimedia.org/r/#/c/363174/" [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/363170 (https://phabricator.wikimedia.org/T164543) (owner: 10Hashar) [11:51:05] paravoid: heh! [11:51:17] (03CR) 10Hashar: "Patch is saved in git branch patch-queue/debian https://gerrit.wikimedia.org/r/#/c/363170/" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/363174 (https://phabricator.wikimedia.org/T164543) (owner: 10Hashar) [11:52:46] paravoid: now the question is whether we want to cherry-pick that fix in 1.13 as a bugfix release or if we should consider it a new feature and prepare 1.14 :) [12:00:32] 10Operations, 10Performance-Team, 10Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3404039 (10Peter) 05Open>03Resolved I've increased the alert to only alert on 40% or more change, let us keep it like that for a while (before we used 20%). [12:11:17] (03CR) 10Ayounsi: [C: 031] Remove smtp port from ferm config [puppet] - 10https://gerrit.wikimedia.org/r/363164 (owner: 10Muehlenhoff) [12:47:26] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3404100 (10fgiunchedi) [12:47:48] 10Operations, 10netops: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3404104 (10ayounsi) [12:48:58] PROBLEM - Host restbase2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:28] RECOVERY - Host restbase2004 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [12:51:18] (03PS1) 10Urbanecm: Upload logos for maiwikimedia, add them to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363176 (https://phabricator.wikimedia.org/T168782) [12:52:28] PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.139 and port 9042: Connection refused [12:52:29] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [12:52:38] PROBLEM - cassandra-c SSL 10.192.32.139:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:52:38] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:53:17] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This would create a duplicate definition causing issues to production while not really solving the problem described in T169164" [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [12:53:28] RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.139 port 9042 [12:53:28] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.137 port 9042 [12:53:38] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 70 days) [12:53:38] RECOVERY - cassandra-c SSL 10.192.32.139:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-c valid until 2017-09-12 15:35:28 +0000 (expires in 70 days) [12:54:28] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, @ema, @bblack care to take a look at the preview ?" [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle) [12:55:44] 10Operations, 10netops: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3404140 (10MoritzMuehlenhoff) Since the tool (after an initially vetted run) only shows difference to the previous run I think it should simply be sent to the standard root mails. (We probably need a little tweak so that it do... [12:56:09] (03PS1) 10Filippo Giunchedi: install_server: enable structured facts at provisioning [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) [12:57:21] akosiaris volans ^ [12:57:35] * volans looking [12:57:50] nope, missing 'in-target' [12:57:53] will amend [12:57:59] (03PS6) 10Urbanecm: Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 (https://phabricator.wikimedia.org/T168422) [12:58:02] (03PS2) 10Urbanecm: Add maiwikimedia to DNS [dns] - 10https://gerrit.wikimedia.org/r/361295 (https://phabricator.wikimedia.org/T168782) [12:58:06] (03PS2) 10Urbanecm: Add maiwikimedia to Apache conf [puppet] - 10https://gerrit.wikimedia.org/r/361296 (https://phabricator.wikimedia.org/T168782) [12:58:37] not sure it the full path is needed too godog [12:58:40] (03PS2) 10Filippo Giunchedi: install_server: enable structured facts at provisioning [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) [12:58:51] volans: yeah I'm not sure either, added it just in case [13:00:51] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3404157 (10fgiunchedi) @aaron yeah it will need some love, in the meantime I've patched in https support so swiftrepl will DTRT. Looks like this completes the... [13:01:02] * volans love puppet config {set,print}... {set,get} was not fancy enough? :D [13:01:19] (03PS7) 10Urbanecm: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) [13:01:48] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) (owner: 10Filippo Giunchedi) [13:03:40] (03PS3) 10Urbanecm: Initial configuration for Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362168 (https://phabricator.wikimedia.org/T168518) [13:03:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] Tweak ores::web config file user. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [13:04:28] (03CR) 10Alexandros Kosiaris: [C: 031] install_server: enable structured facts at provisioning [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) (owner: 10Filippo Giunchedi) [13:04:34] (03CR) 10Ema: [C: 031] "Nice, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle) [13:04:46] volans: not that'd have been consistent! [13:04:54] (03PS3) 10Filippo Giunchedi: install_server: enable structured facts at provisioning [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) [13:05:47] I was happy to find out about config set though, I was already thinking on how to do the same without it [13:06:11] yeah I noticed, I was already thinking awk :D [13:06:29] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2005.codfw.wmnet [13:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:15] (03CR) 10Filippo Giunchedi: [C: 032] install_server: enable structured facts at provisioning [puppet] - 10https://gerrit.wikimedia.org/r/363177 (https://phabricator.wikimedia.org/T169612) (owner: 10Filippo Giunchedi) [13:17:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comments inline. Overall data is 16GB (per week) which is not much, we should not have a problem." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [13:18:49] (03PS2) 10Hashar: 0.1.1-wmf8: add nodes with SSH host key verification [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/363174 (https://phabricator.wikimedia.org/T164543) [13:21:28] (03CR) 10Hashar: [C: 032] 0.1.1-wmf8: add nodes with SSH host key verification [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/363174 (https://phabricator.wikimedia.org/T164543) (owner: 10Hashar) [13:21:39] (03CR) 10Hashar: [C: 032] WMF: add nodes with SSH host key verification [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/363170 (https://phabricator.wikimedia.org/T164543) (owner: 10Hashar) [13:26:49] (03CR) 10Filippo Giunchedi: [C: 032] "Merging to test a stretch reprovision and wmf-auto-reimage" [puppet] - 10https://gerrit.wikimedia.org/r/363169 (owner: 10Faidon Liambotis) [13:26:58] (03PS2) 10Filippo Giunchedi: salt: drop the saltversion fact, use os_version() [puppet] - 10https://gerrit.wikimedia.org/r/363169 (owner: 10Faidon Liambotis) [13:30:00] (03PS1) 10Aude: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 [13:30:15] (03CR) 10Paladox: "> This would create a duplicate definition causing issues to" [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [13:30:31] (03PS2) 10Aude: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 [13:35:20] (03CR) 10Paladox: "> since we don't really use drafts, this _seems_ right" [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [13:36:33] (03CR) 10Alexandros Kosiaris: [C: 04-2] "My guess would be $deployment is set to something else than scap3 through Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [13:38:11] (03PS1) 10Ema: Add IPv6 support to all monitors [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/363182 [13:38:35] (03CR) 10Alexandros Kosiaris: "Could you please explain why ?" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [13:39:08] (03CR) 10WMDE-leszek: [C: 031] Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [13:39:13] (03CR) 10Paladox: "> Could you please explain why ?" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [13:40:34] (03PS2) 10Faidon Liambotis: Revert "base: cleanup unneeded ipmi packages/checks" [puppet] - 10https://gerrit.wikimedia.org/r/362980 [13:40:52] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Revert "base: cleanup unneeded ipmi packages/checks" [puppet] - 10https://gerrit.wikimedia.org/r/362980 (owner: 10Faidon Liambotis) [13:48:31] 10Operations, 10Gerrit: rename user gerrit2 to gerrit - https://phabricator.wikimedia.org/T169634#3404360 (10Paladox) [13:54:39] (03PS1) 10Alexandros Kosiaris: WIP monitoring: provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [13:55:57] (03CR) 10jerkins-bot: [V: 04-1] WIP monitoring: provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [13:56:27] (03PS1) 10Ema: Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] - 10https://gerrit.wikimedia.org/r/363188 (https://phabricator.wikimedia.org/T82747) [14:03:31] 10Operations, 10netops: zombie rstp configuration - https://phabricator.wikimedia.org/T169637#3404423 (10ayounsi) [14:05:32] (03PS1) 10Filippo Giunchedi: salt: let wmf_auto_reimage wait after the first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/363189 (https://phabricator.wikimedia.org/T169601) [14:07:03] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/363189 (https://phabricator.wikimedia.org/T169601) (owner: 10Filippo Giunchedi) [14:09:34] 10Operations, 10Operations-Software-Development: Migrate debdeploy to cumin - https://phabricator.wikimedia.org/T164817#3404458 (10MoritzMuehlenhoff) [14:09:50] (03PS2) 10Filippo Giunchedi: salt: let wmf_auto_reimage wait after the first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/363189 (https://phabricator.wikimedia.org/T169601) [14:11:25] (03CR) 10Filippo Giunchedi: [C: 032] salt: let wmf_auto_reimage wait after the first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/363189 (https://phabricator.wikimedia.org/T169601) (owner: 10Filippo Giunchedi) [14:15:43] !log reset db2038's iLO [14:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:54] (03Restored) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 (owner: 10Hashar) [14:17:39] (03PS3) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 [14:19:11] (03Restored) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 (owner: 10Hashar) [14:19:20] (03PS3) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 (https://phabricator.wikimedia.org/T159254) [14:23:59] PROBLEM - salt-minion processes on ms-fe2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:24:29] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:24:29] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [14:26:20] (03CR) 10Giuseppe Lavagetto: [C: 031] Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] - 10https://gerrit.wikimedia.org/r/363188 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [14:30:47] (03CR) 10Ema: [C: 032] Add IPv6 support to all monitors [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/363182 (owner: 10Ema) [14:31:04] ! [14:31:13] (03CR) 10Ema: [C: 032] Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] - 10https://gerrit.wikimedia.org/r/363188 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [14:31:22] !log copy nginx from jessie-wikimedia to stretch-wikimedia [14:31:24] (03CR) 10Ema: [V: 032 C: 032] Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] - 10https://gerrit.wikimedia.org/r/363188 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [14:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:50] (03PS1) 10Ema: Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/363193 (https://phabricator.wikimedia.org/T82747) [14:32:41] (03CR) 10Paladox: "It was removed in https://github.com/GerritCodeReview/gerrit/commit/c4a90512d540041be5c256ce6aba1dae44b7aecf#diff-bbc1adecda45cb2e6ab44d16" [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [14:33:14] godog: we'll need to rebuild nginx for stretch-wikimedia, the jessie package is linked against our custom OpenSSL 1.1 package for jessie [14:34:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3404554 (10ayounsi) [14:34:55] moritzm: sigh, I suspected it wasn't going to be so simple heh [14:35:08] I'll build it on copper [14:36:10] moritzm: do you know from what repo? [14:36:45] let me check [14:37:54] I think we can safely set operations/debs/nginx as deprecated, last commit was 2013 [14:37:54] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3404579 (10ayounsi) [14:38:45] yeah, operations/software/nginx isn't used any longer, for the last uploads only the package was updated, so simply rebuild then existing source package for stretch [14:39:10] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage gerrit2001 and cobalt as stretch - https://phabricator.wikimedia.org/T168562#3404595 (10Paladox) [14:39:30] ack [14:39:57] if you do that, please change the Build dependency from "libssl11-dev" to "libssl11-dev | libssl-dev", so that we can rebuild the same source on jessie and stretch [14:41:31] (03CR) 10Ema: [V: 032 C: 032] Add IPv6 support to all monitors, improve IdleConnection logging [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/363193 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [14:44:09] moritzm: yup, will do, I'll ping brandon tomorrow to check where/if we keep nginx somewhere in git [14:44:17] (03PS2) 10Alexandros Kosiaris: WIP monitoring: provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [14:44:19] (03PS1) 10Alexandros Kosiaris: Bump rspec-puppet to 2.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/363194 [14:49:33] (03CR) 10jerkins-bot: [V: 04-1] WIP monitoring: provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [14:50:19] (03PS1) 10Jcrespo: [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [14:51:24] godog: I'm pretty sure we don't any longer, IIRC the git repo was only used temporarily [14:52:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:52:20] RECOVERY - salt-minion processes on ms-fe2005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:52:25] ugh, if we don't we should I think, at least to track our patches [14:52:39] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational [14:52:49] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:52:55] godog: actually, it's still used, last commit is from a few months ago: https://github.com/wikimedia/operations-software-nginx/commit/df93f9585393f4294423ce23c1015942b7e5b967 [14:53:24] seem to have confused it with another package [14:54:33] moritzm: ah nevermind I was looking at *debs* not *software* [14:54:55] me too initially :-) [14:55:19] PROBLEM - salt-minion processes on ms-fe2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:56:11] (03PS1) 10Faidon Liambotis: ipmi: add ipmitool support to ipmi_lan [puppet] - 10https://gerrit.wikimedia.org/r/363197 [14:57:00] !log pybal 1.13.7 uploaded to apt.w.o, testing it on pybal-test2001 T82747 T154759 [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:10] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [14:57:10] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [14:58:12] (03PS2) 10Jcrespo: [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [14:58:26] (03CR) 10Faidon Liambotis: [C: 032] ipmi: add ipmitool support to ipmi_lan [puppet] - 10https://gerrit.wikimedia.org/r/363197 (owner: 10Faidon Liambotis) [14:58:34] (03PS2) 10Faidon Liambotis: ipmi: add ipmitool support to ipmi_lan [puppet] - 10https://gerrit.wikimedia.org/r/363197 [14:58:39] (03CR) 10Faidon Liambotis: [V: 032 C: 032] ipmi: add ipmitool support to ipmi_lan [puppet] - 10https://gerrit.wikimedia.org/r/363197 (owner: 10Faidon Liambotis) [14:59:05] (03PS1) 10Filippo Giunchedi: Build-Depend on libssl11-dev | libssl-dev for jessie and stretch compat [software/nginx] - 10https://gerrit.wikimedia.org/r/363199 [14:59:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:00:09] 10Operations, 10netops: zombie rstp configuration - https://phabricator.wikimedia.org/T169637#3404706 (10faidon) I know nothing about that, likely a stock config that wasn't cleaned up and left cruft behind. [15:00:56] (03CR) 10Muehlenhoff: [C: 031] Build-Depend on libssl11-dev | libssl-dev for jessie and stretch compat [software/nginx] - 10https://gerrit.wikimedia.org/r/363199 (owner: 10Filippo Giunchedi) [15:01:06] 10Operations, 10Patch-For-Review: Default to ext4 instead of ext3 - https://phabricator.wikimedia.org/T169605#3404713 (10fgiunchedi) [15:02:42] !log set operations/debs/nginx as hidden and update description [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] (03PS3) 10Jcrespo: [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [15:05:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:06:29] !log mobrovac@tin Started deploy [citoid/deploy@9d22567]: Fallback to crossRef (T165105) and use MarcXML (T165105) [15:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] T165105: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105 [15:09:21] !log mobrovac@tin Finished deploy [citoid/deploy@9d22567]: Fallback to crossRef (T165105) and use MarcXML (T165105) (duration: 02m 52s) [15:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:37] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3404751 (10faidon) In addition to these, the following return 192.168.0.0/16 addresses for either their address or the gateway: - analytics1047.eqiad.wmnet - analytics1061.eqiad.wmnet - auth2001.... [15:09:48] (03PS4) 10Jcrespo: [WIP] Support multiple instances on the mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [15:13:07] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3404760 (10faidon) [15:14:25] (03PS1) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) [15:20:30] (03PS5) 10Jcrespo: [WIP]mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [15:22:27] RECOVERY - salt-minion processes on ms-fe2005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:23:04] (03CR) 10Aude: Enable WikibaseQualityConstraints statements (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [15:27:09] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Patch-For-Review, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3384741 (10aude) @thiemowmde do we have an https link to point to for this? [15:29:47] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:29:58] (03CR) 10Aude: "suggest just to remove the alias." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [15:30:59] (03PS6) 10Jcrespo: mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [15:33:46] (03PS1) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) [15:35:44] 10Puppet, 10Cloud-Services: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612#3404848 (10Multichill) @scfc You can probably start with this one? Or is WDQ still blocking you in some way? [15:47:13] (03PS1) 10Addshore: DNM remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 [15:47:55] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10Tgr) ORES is considering useragent-based connection limiting (we ar... [15:49:44] (03PS1) 10Addshore: DNM Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 [15:50:08] 10Operations, 10Goal: Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) - https://phabricator.wikimedia.org/T169658#3404946 (10jcrespo) [15:52:05] 10Operations, 10Goal: Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) - https://phabricator.wikimedia.org/T169658#3404958 (10jcrespo) [15:52:35] (03CR) 10Addshore: [C: 04-2] DNM remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [15:52:38] (03CR) 10Addshore: [C: 04-2] DNM Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [15:52:56] (03CR) 10Giuseppe Lavagetto: [C: 031] TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 (owner: 10Volans) [15:53:51] (03PS2) 10Addshore: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) [15:53:58] (03PS2) 10Volans: TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 [15:54:08] (03CR) 10jerkins-bot: [V: 04-1] TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 (owner: 10Volans) [15:56:31] how on earth it can have conflicts I dunno, but ok jenkins, I will resolve them [15:59:52] (03CR) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints statements (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [16:01:11] (03PS2) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) [16:02:08] (03CR) 10Hashar: [C: 031] "Looks all fine locally." [puppet] - 10https://gerrit.wikimedia.org/r/363194 (owner: 10Alexandros Kosiaris) [16:06:34] (03CR) 10Giuseppe Lavagetto: [C: 031] nutcracker: validate new config file [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) (owner: 10Filippo Giunchedi) [16:10:44] !log rebooting radium for kernel update [16:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 9042: Connection refused [16:14:45] PROBLEM - Check systemd state on maps-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:56] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [16:14:56] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [16:15:05] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 28 minutes ago with 15 failures. Failed resources (up to 3 shown): Service[postgresql@9.4-main],Exec[create_user-kartotherian],Exec[create_user-monitoring@maps-test2002],Exec[create_user-tileratorui] [16:15:05] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:15:32] ^those shoudl have been silenced, checking... [16:21:44] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3405073 (10Dzahn) [16:24:39] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3405078 (10Dzahn) One at a time please, first gerrit2001 only i suggest. [16:24:55] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2002798 [16:34:00] (03PS1) 10Muehlenhoff: Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 [16:34:58] (03PS2) 10Muehlenhoff: Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 [16:41:16] (03PS1) 10Giuseppe Lavagetto: Re-add support for defining threads from CI/cli [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363214 [16:41:18] (03PS1) 10Giuseppe Lavagetto: Move HostWorker to a dedicated class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363215 [16:41:20] (03PS1) 10Giuseppe Lavagetto: Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216 [16:41:22] (03PS1) 10Giuseppe Lavagetto: [WiP] Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [16:41:56] (03CR) 10jerkins-bot: [V: 04-1] Move HostWorker to a dedicated class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363215 (owner: 10Giuseppe Lavagetto) [16:41:59] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [16:42:03] (03PS1) 10Jcrespo: ep_courses.course_token should not be public, but filtered [puppet] - 10https://gerrit.wikimedia.org/r/363218 (https://phabricator.wikimedia.org/T169661) [16:42:11] <_joe_> I know of both :) [16:43:19] (03PS2) 10Jcrespo: ep_courses.course_token should not be public, but filtered [puppet] - 10https://gerrit.wikimedia.org/r/363218 (https://phabricator.wikimedia.org/T169661) [16:46:14] (03CR) 10Jcrespo: [C: 032] ep_courses.course_token should not be public, but filtered [puppet] - 10https://gerrit.wikimedia.org/r/363218 (https://phabricator.wikimedia.org/T169661) (owner: 10Jcrespo) [16:49:35] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:54:43] !log dropping ukwikimedia from several labsdbhosts [16:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:45] (03CR) 10Marostegui: "I haven't reviewed the dbstore_multiinstance role yet." [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:05:15] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [17:05:35] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [17:05:55] RECOVERY - Check systemd state on maps-test2001 is OK: OK - running: The system is fully operational [18:20:47] (03Abandoned) 10Mforns: [WIP] Fix timestamp infinite loop in EL purging script (2) [puppet] - 10https://gerrit.wikimedia.org/r/362101 (owner: 10Mforns) [18:20:59] (03Abandoned) 10Mforns: [WIP] Fix timestamp infinite loop in EL purging script (1) [puppet] - 10https://gerrit.wikimedia.org/r/362103 (owner: 10Mforns) [19:07:29] (03PS12) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [19:25:25] (03PS13) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [19:48:16] (03PS1) 10Brian Wolff: Redact ep_courses.course_token [puppet] - 10https://gerrit.wikimedia.org/r/363230 (https://phabricator.wikimedia.org/T169661) [20:11:21] (03PS14) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [20:16:50] (03PS15) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [20:18:36] (03PS16) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [20:39:39] (03PS1) 10MaxSem: Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) [20:40:45] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:41:55] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:45:57] what? [20:46:21] * volans checking [20:48:04] moritzm: did you upgrade anything 6h ago on stat1003 and snapshot100[1,5,7]? [20:48:35] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:48:41] their loadavg is skyroketing, but it's a fake load [20:48:57] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:49:03] s/skyroketing/sky-rocketing/ [20:51:43] stat1003 is complaining about nfs, that explains the "fake load", checking the others [20:52:46] ok seems the NFS on dataset1001.wikimedia.org, checking [20:54:01] [Jul 4 15:07] INFO: task nfsd:1764 blocked for more than 120 seconds. [20:54:15] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:54:43] (03CR) 10Gergő Tisza: [C: 031] "A good way to handle WP0 Phab abuse IMO. I don't know enough about varnish and WP0 to comment on the implementation." [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [20:55:04] https://commons.wikimedia.org/wiki/File:444_album_cover.png any idea why the thumbnail doesn't display? [20:55:13] should I open a report? [20:57:14] volans: no, didn't update those [20:57:45] moritzm: sorry for the ping I figured later that is the NFS on datase1001 [20:57:49] do you know anything about it? [20:57:55] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:58:34] volans: no idea, might be caused by a current dump run which is load intensive? [20:58:53] (03CR) 10Ladsgroup: [C: 031] "I confirm it's her :)" [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [20:59:04] I'll check running processes, for now the kernel logged [20:59:05] INFO: task nfsd:1766 blocked for more than 120 seconds [21:02:32] yannf: Its giving a 429 error [21:02:37] which i don't even know what that means [21:02:41] so probably yes [21:03:02] 429 = too many requests [21:03:23] I only get a blank page [21:04:09] yannf: The error is in the http status code, its a blank page with that error [21:05:32] I would guess maybe thumbor has some sort of rate limitting, but doesn't send out a human readable error code [21:05:41] I don't actually know if thumbor is enabled yet [21:05:45] gilles: ---^ [21:06:57] https://phabricator.wikimedia.org/T169678 done [21:13:28] bawolff: 429 is also returned by the rate limiter in varnish, but IIRC only for some UA [21:14:30] The http error text was also kind of weird, something like "No phrase available" (I since have closed the window) [21:14:49] could very well be varnish, I'm not really familar with the new thumbnail pipeline [21:15:07] sorry busy with another issue atm [21:15:21] "No reason phrase" [21:16:15] I tried: download, open in Gimp, save with a different compression rate, reupload -> idem [21:17:14] reverted, now it works!???! [21:17:26] 10Operations, 10Commons, 10Thumbor, 10Traffic: PNG thumbnail gives a 429 error - https://phabricator.wikimedia.org/T169678#3405600 (10Bawolff) https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/444_album_cover.png/600px-444_album_cover.png gives: ``` HTTP/2.0 429 No Reason Phrase Date: Tue, 04 Jul... [21:19:53] reverting could very well change the ratelimiting bucket, or clear a cached rate limit response or something [21:20:38] 10Operations, 10Commons, 10Thumbor, 10Traffic: PNG thumbnail gives a 429 error - https://phabricator.wikimedia.org/T169678#3405604 (10Bawolff) [Even if this particular image is subsequently fixed, it should be investigated why there is no friendly html error message returned] [21:23:05] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3405605 (10Dereckson) [21:26:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:36:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:38:20] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3405612 (10Volans) [21:38:38] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3405624 (10Volans) p:05Triage>03High [21:39:39] ACKNOWLEDGEMENT - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:39:39] ACKNOWLEDGEMENT - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:39:39] ACKNOWLEDGEMENT - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:39:39] ACKNOWLEDGEMENT - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:39:39] ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:39:39] ACKNOWLEDGEMENT - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago Volans dataset1001 NFS overloaded https://phabricator.wikimedia.org/T169680 [21:40:50] !log ACK'ed puppet not running on stat100[2-3],snapshot100[1,5-7] due to NFS overloaded on dataset1001 - T169680 [21:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:02] T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680 [21:41:34] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3405612 (10Paladox) It shows that it using 4.9.0-0.bpo.2-amd64 labs found problems with that kernel. downgrading to the default one in jessie fixed it. (a... [21:44:42] volans labs had that problem with nfs when they tryed upgrading the kernal 4.9 last week. [21:45:22] volans https://phabricator.wikimedia.org/T169289 [21:46:40] paladox: thanks, I know, but this was not upgraded that recently [21:47:00] Oh. ok [21:47:16] so could (should?) be unrelated [22:54:55] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 220502 [23:06:00] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3405677 (10ArielGlenn) The dump processes were all hung for hours. I did nfs-kernel-service restart on dataset1001 and shot the datasets processes on snap... [23:23:25] (03Abandoned) 10Legoktm: Enable $wgAllowSiteCSSOnRestrictedPages on arbcom-de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357022 (https://phabricator.wikimedia.org/T166947) (owner: 10Framawiki)