[00:01:16] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:26] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:27] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:27] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:27] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:27] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:01:36] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [00:04:06] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [00:04:16] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [00:04:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [00:04:26] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [00:04:26] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [00:04:26] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [00:04:26] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [00:05:05] ACKNOWLEDGEMENT - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdf] daniel_zahn https://phabricator.wikimedia.org/T166021 [00:05:33] (03CR) 10Dzahn: [C: 032] graphite: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353364 (owner: 10Dzahn) [00:07:11] (03CR) 10Dzahn: "no-op" [puppet] - 10https://gerrit.wikimedia.org/r/353364 (owner: 10Dzahn) [00:09:10] (03PS1) 10Aaron Schulz: Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) [00:18:30] (03PS4) 10Dzahn: contint: role/profile conversion [puppet] - 10https://gerrit.wikimedia.org/r/355156 [00:24:54] (03CR) 10Dzahn: [C: 04-1] "more questions, are we really going to use hieradata/profile/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/355156 (owner: 10Dzahn) [00:40:58] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3285335 (10faidon) a:05faidon>03Cmjohnson I don't see a process for refetching that, so we'll likely need to go and someone at the RIPE NCC. That said, if it was the board that was problematic, aren't we reusing... [01:02:36] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:02:36] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:04:26] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [01:04:26] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [01:31:26] PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [01:32:26] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74145 bytes in 0.675 second response time [02:23:01] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 25s) [02:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:18] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 23 02:29:17 UTC 2017 (duration 6m 16s) [02:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:32] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3285362 (10Cmjohnson) We did reuse the storage card. I believe the mac address is the only change. @Faidon it's connected via console. [04:09:56] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3826.90 Read Requests/Sec=1578.30 Write Requests/Sec=4.80 KBytes Read/Sec=35127.60 KBytes_Written/Sec=568.40 [04:19:56] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=9.20 Read Requests/Sec=0.20 Write Requests/Sec=0.60 KBytes Read/Sec=1.20 KBytes_Written/Sec=6.40 [04:42:16] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 60972 MB (12% inode=99%) [05:27:36] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:27:36] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:27:36] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:27:36] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:28:26] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:28:36] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [05:28:36] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:28:36] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:29:26] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [05:29:27] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [05:29:27] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [05:29:27] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [05:30:36] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [05:31:16] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [05:40:16] RECOVERY - Disk space on elastic1028 is OK: DISK OK [06:09:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 [06:09:10] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 [06:16:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui) [06:19:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui) [06:19:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui) [06:20:10] !log Deploy alter table on s2 eqiad master db1054 - T162611 [06:20:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1021 - T162611 (duration: 00m 38s) [06:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:20] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:59] (03PS2) 10Elukey: Remove any reference of mc1001->mc1018 for decom [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341) [06:29:05] !log Deploy alter table on s7.frwiktionary db2040 and db1034 - T165743 [06:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:14] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [06:30:22] (03Abandoned) 10Muehlenhoff: Make HHVM depend on nutcracker service [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff) [06:49:07] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3285429 (10elukey) [06:50:35] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3230583 (10elukey) @Cmjohnson: The hosts are ready for the non interruptible steps, including https://gerrit.wikimedia.org/r/354453, so I haven't merge... [06:50:48] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3285432 (10elukey) a:03Cmjohnson [07:07:02] !log Rename gather_list gather_list_flag gather_list_item on db1078 db1094 and db1089 - T166097 [07:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:09] T166097: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097 [07:13:06] !log Deploy schema change on ruwiki.ores_classification directly on codfw master (db2028) - T164530 [07:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:14] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:17:14] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3285476 (10Gehel) We can keep elastic2020 down for a few more weeks if needed. The cluster is able to sustain the current load with -1 node. [07:25:04] !log installing openjdk security updates on maps and wdqs clusters [07:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:26] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:44:46] PROBLEM - DPKG on kafka1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:44:46] PROBLEM - DPKG on kafka1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:45:06] PROBLEM - DPKG on kafka2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:45:07] PROBLEM - DPKG on kafka1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:45:26] PROBLEM - DPKG on kafka2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:45:27] PROBLEM - DPKG on kafka2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:48:16] !log addshore@terbium:~$ ~/mymwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php et+wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407 [07:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:24] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [07:50:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:51:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:51:38] * elukey stares to addshore for the 5xx [07:51:41] :D [07:51:43] checking them [07:51:45] O_o [07:51:53] should not be me [07:52:17] stopped my script anyway [07:52:25] ahhahah I know I know [07:52:34] from https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X it seems a temp spike [07:53:42] (03CR) 10Hashar: HHVM: Fix puppet on trusty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [07:54:26] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:54:30] Well, could be totlaly unrelated, but looking at https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-48h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1031 regarding T164407 i start to see things odd happening again [07:54:31] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [07:54:33] seems cp3033 related [07:55:12] (03PS10) 10Hashar: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [07:55:28] a lot of errors in the spike end up with cp3033 int in x-cache [07:56:06] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] [07:57:06] RECOVERY - DPKG on kafka2001 is OK: All packages OK [07:57:26] RECOVERY - DPKG on kafka2002 is OK: All packages OK [07:57:41] yeah there is a big fetch failed event https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3033 [07:58:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:58:22] ema: ---^ [07:58:32] afaics seems cp3033 related [07:58:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:58:56] Varnish backends main threads are really high, the rest looks good [08:00:27] !log the last script I started is now stopped [08:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:06] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] [08:08:46] RECOVERY - DPKG on kafka1001 is OK: All packages OK [08:09:06] RECOVERY - DPKG on kafka1002 is OK: All packages OK [08:09:47] RECOVERY - DPKG on kafka1003 is OK: All packages OK [08:10:26] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] [08:13:00] !log Force WB as a default policy on db1031 because of degraded BBU [08:13:06] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:26] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:14:46] PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:15:39] !log apply manually https://gerrit.wikimedia.org/r/#/c/351854/2/wmf-config/jobqueue.php (persistent connections between hhvm and redis) to mw1161 as production test [08:15:46] PROBLEM - DPKG on stat1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:45] ^stat1004/contint1001 are expected [08:24:26] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[initramfs-tools],Package[openjdk-7-jdk] [08:26:34] (03CR) 10Hashar: "We can most probably merge this one. Note I have removed HHVM from the CI Trusty images via https://gerrit.wikimedia.org/r/#/c/355187/" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [08:27:40] 06Operations, 10ops-eqiad, 10DBA: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10Marostegui) [08:28:46] RECOVERY - DPKG on stat1004 is OK: All packages OK [08:28:56] no changes to TIME_WAITs for mw1161 [08:29:09] (manually changed /srv/mediawiki/wmf-config/jobqueue.php) [08:29:46] RECOVERY - DPKG on contint1001 is OK: All packages OK [08:30:26] RECOVERY - DPKG on kafka2003 is OK: All packages OK [08:35:58] 06Operations, 10ops-eqiad, 10DBA: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285604 (10Marostegui) After a long while the BBU shows `Optimal` again, so looks like the manual relearn worked (the same way it did on db1048 - T160731#3109104 ) Setting the policy back to its default... [08:47:57] (03CR) 10Multichill: [C: 031] "Yes please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man) [08:53:26] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:53:46] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:54:06] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [08:54:36] mutante: no profiles should NOT have a system::role. Only real roles, so yes profiles should lose them when being converted. [08:57:16] akosiaris: (curious) - so roles should theoretically only contain system::role and profile includes ? [09:05:46] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:11:04] (03PS4) 10Volans: Puppet compiler: automatically sync from all masters [puppet] - 10https://gerrit.wikimedia.org/r/354105 (https://phabricator.wikimedia.org/T165583) [09:13:34] (03CR) 10Volans: [C: 032] Puppet compiler: automatically sync from all masters [puppet] - 10https://gerrit.wikimedia.org/r/354105 (https://phabricator.wikimedia.org/T165583) (owner: 10Volans) [09:13:39] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3285703 (10ayounsi) a:03ayounsi [09:15:39] !log reverted manual hack on mw1161 with scap pull [09:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:00] !log Restarting Jenkins on contint1001 [09:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:38] elukey: not theoretically, in practice (after the migration is done) [09:19:46] but yes [09:19:52] super thanks [09:20:20] elukey: https://wikitech.wikimedia.org/wiki/Puppet_coding#Roles [09:22:05] akosiaris: for some reason I didn't remember 1. :) [09:22:14] but it makes sense [09:24:31] !log restarting cassandra on restbase1007, restbase1009, restbase1012 to pick up Java security updates [09:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:52] 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Puppet compiler: sync facts from all workers - https://phabricator.wikimedia.org/T165583#3285745 (10Volans) 05Open>03Resolved Documentation updated on https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation [09:46:18] !log addshore@terbium:/srv/mediawiki/php-1.30.0-wmf.1$ mwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407 [09:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:26] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [09:49:24] !log swift eqiad-prod: ms-be1028/ms-be1039 object weight 3500 - T160640 [09:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:33] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [09:50:35] (03PS6) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [09:52:17] (03PS1) 10Jcrespo: raid-check: Return critical in WriteThough mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [09:52:39] (03PS2) 10Filippo Giunchedi: logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451) [09:54:16] (03PS2) 10Jcrespo: raid-check: Return critical in WriteThough mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [09:55:11] (03CR) 10Filippo Giunchedi: [C: 032] logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [09:55:15] (03CR) 10Filippo Giunchedi: [C: 032] logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [09:56:08] !log restarting cassandra on restbase1013, restbase1014, restbase1015, restbase1017 to pick up Java security updates [09:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:09] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285788 (10Marostegui) And this happened again: ``` root@db1031:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3830 mV Current: -685 mA Temperatur... [10:06:37] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285791 (10Marostegui) [10:08:16] (03PS1) 10Giuseppe Lavagetto: Debianization fixups [calico-containers] - 10https://gerrit.wikimedia.org/r/355191 [10:08:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Debianization fixups [calico-containers] - 10https://gerrit.wikimedia.org/r/355191 (owner: 10Giuseppe Lavagetto) [10:10:45] (03PS1) 10Giuseppe Lavagetto: profile::calico::builder: fix branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/355192 [10:12:09] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::calico::builder: fix branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/355192 (owner: 10Giuseppe Lavagetto) [10:14:42] !log Run pt-table-checksum on s7.frwiktionary - T165743 [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:52] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [10:15:29] 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3285795 (10akosiaris) Per https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=7680e3d95eee2aa98b1c461dbc0dcc5c&host=db1016&user=&schema=puppet&hours=24 this seems to be occurin... [10:16:25] (03PS6) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 [10:23:56] (03PS11) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [10:24:33] (03CR) 10Volans: "Addressed comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans) [10:24:51] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:26:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) [10:27:18] !log upload kafkatee 0.1.5 to jessie-wikimedia, remove unused kafkatee 0.1.4 from trusty-wikimedia - T149451 [10:27:23] (03PS2) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) [10:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] T149451: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451 [10:29:33] (03PS12) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [10:33:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:34:40] (03PS1) 10Filippo Giunchedi: role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196 [10:35:36] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:36] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:36] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:36] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:37] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:46] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:47] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:47] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:48] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:48] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:35:49] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:36:36] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [10:37:46] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [10:37:46] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:38:51] (03CR) 10Elukey: [C: 032] role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196 (owner: 10Filippo Giunchedi) [10:39:07] (03PS2) 10Elukey: role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196 (owner: 10Filippo Giunchedi) [10:39:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:39:46] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:40:46] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:40:48] moritzm: --^ is that you upgrading ? [10:41:36] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [10:41:46] elukey: I don't think so. cassandra on restbase1014 was restarted with c-forearch-restart, but I don't think that's related [10:41:55] ^godog, any idea about the error message? [10:42:28] mhh no I haven't seen that before [10:42:30] checking one of the hosts [10:42:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:43:32] seems Error: Operation timed out - received only 1 responses. from restbase [10:43:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 - T164530 (duration: 00m 38s) [10:43:36] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:42] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [10:44:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:46:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) [10:46:36] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:47:09] godog: could it be that the instances on restbase1014 going down have caused some read to quorum to fail ? [10:48:26] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [10:48:27] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [10:48:27] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [10:48:27] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [10:48:27] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [10:48:27] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:48:36] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [10:48:36] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [10:48:36] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [10:48:36] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:48:37] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [10:48:37] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [10:48:37] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [10:48:38] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [10:48:38] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [10:48:39] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [10:48:39] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [10:48:40] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [10:48:40] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [10:48:41] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [10:48:51] elukey: that might be yeah, in fact it just recovered, did restbase say what hosts was expecting a reply but didn't? [10:49:37] I can see a lot of Error: Operation timed out - received only 1 responses [10:50:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:53:10] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:53:19] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [10:54:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085, depool db1088 - T164530 (duration: 00m 38s) [10:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:36] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [10:55:45] (03PS1) 10Marostegui: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) [10:56:40] <_joe_> elukey, godog that error means that service-checker could not obtain a response within 5 seconds for that request [10:57:04] <_joe_> so it seems a backend timeout for restbase [10:58:47] indeed [10:58:59] yep [11:00:18] but I was wondering if the cassandra restarts caused some reads to fail the quorum [11:00:22] returning errors [11:02:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:06:43] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:06:53] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:07:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088, depool db1093 - T164530 (duration: 00m 38s) [11:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:44] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [11:09:09] (03PS1) 10Marostegui: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) [11:09:44] (03PS3) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:10:29] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:10:33] (03PS4) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:11:18] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:11:33] (03PS5) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:12:13] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:12:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:18:40] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:18:52] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [11:19:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T164530 (duration: 00m 38s) [11:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:41] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [11:34:19] (03PS6) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:35:12] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:42:19] (03CR) 10Filippo Giunchedi: [C: 031] Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans) [11:42:50] <_joe_> !log pushed calico/kube-policy-controller:0.6.0 to the docker registry [11:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] (03PS7) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:44:01] <_joe_> !log pushed calico/node:1.2.0 to the docker registry [11:44:07] (03PS8) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:57] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:48:48] (03PS9) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:49:40] (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:51:25] <_joe_> !log uploaded calicoctl 1.2.0-1~wmf1 to jessie-wikimedia [11:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:52] <_joe_> !log uploaded calico-cni 1.8.3-1~wmf1 to jessie-wikimedia [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:19] (03PS10) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [11:55:44] (03CR) 10Jcrespo: [C: 031] "root@db1015:~$ megacli -LDSetProp WT -L0 -a0" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:56:04] !log set vm.dirty_backround_bytes=25165824 on aqs1004 as part of testing for https://gerrit.wikimedia.org/r/#/c/354107 (Rollback: set vm.dirty_backround_ratio=10) [11:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:54] (03CR) 10Elukey: "Set vm.dirty_backround_bytes=25165824 on aqs1004 as test. Will come back to this code review in a day or two." [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [11:57:11] (03CR) 10Marostegui: [C: 031] "Thanks for putting this together so quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:57:41] (03CR) 10Jcrespo: [C: 031] "I checked analytics hosts and swift hosts with dell, and all seem to be using writeback, and will benefit from this." [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [11:58:11] (03PS11) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [12:09:27] !log uploaded hhvm 3.18.2+dfsg-1+wmf4 to apt.wikimedia.org (contains extended upstream fix for XML reader crash) (T162586) [12:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:37] T162586: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586 [12:09:56] !log joal@tin Started deploy [analytics/refinery@222d0c0]: (no justification provided) [12:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:52] !log joal@tin Finished deploy [analytics/refinery@222d0c0]: (no justification provided) (duration: 03m 56s) [12:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:12] !log joal@tin Started deploy [analytics/refinery@679aeea]: Weekly deploy (with 2 weeks late, big deploy)h [12:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] !log upgrading mw1261-mw1265 to hhvm 3.18.2+dfsg-1+wmf4 [12:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:46] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%) [12:23:20] joal: ill fix that deploy log to remove the extra h for you :) [12:23:38] Thanks a lot Zppix - Sorry for the mess [12:24:01] joal: hey i do it all the time no worries [12:24:08] :) [12:24:36] !log joal@tin Finished deploy [analytics/refinery@679aeea]: Weekly deploy (with 2 weeks late, big deploy)h (duration: 04m 24s) [12:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:01] Will fix ^ [12:25:06] Give me a min [12:25:30] fixing stat1002 :) [12:27:47] RECOVERY - Disk space on stat1002 is OK: DISK OK [12:31:00] Is this week a norm deploy schedule... i know what wikitech says but i want human verification [12:31:26] yes [12:31:55] Alright thanks MatmaRex o/ [12:32:47] Zppix: you've something to deploy at the next EU SWAT? [12:33:29] Dereckson: nope just verifying incase something comes up at all this week [12:33:46] Dereckson: thanks for the concern however [12:33:48] ok [12:35:01] (03PS1) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 [12:36:59] (03PS2) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) [12:38:13] !log joal@tin Started deploy [analytics/refinery@679aeea]: Weekly deploy (2 weeks late, big deploy)-2 [12:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:48] !log joal@tin Finished deploy [analytics/refinery@679aeea]: Weekly deploy (2 weeks late, big deploy)-2 (duration: 01m 35s) [12:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:10] (03PS2) 10Dereckson: Apache: add techconduct.wm.o to remnant sites [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) [12:43:13] (03CR) 10Dereckson: Apache: add techconduct.wm.o to remnant sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:45:55] !log elukey@tin Started deploy [analytics/refinery@679aeea]: (no justification provided) [12:45:56] !log elukey@tin Finished deploy [analytics/refinery@679aeea]: (no justification provided) (duration: 00m 01s) [12:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:08] jouncebot: refresh [12:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:24] I refreshed my knowledge about deployments. [12:46:46] !log elukey@tin Started deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment [12:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:28] !log elukey@tin Finished deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment (duration: 00m 42s) [12:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:46] PROBLEM - Disk space on elastic1021 is CRITICAL: DISK CRITICAL - free space: /srv 61487 MB (12% inode=99%) [12:49:33] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286077 (10ayounsi) >>! In T165614#3271670, @BBlack wrote: > 1. Probably the reason for a lack of neighbors is that some (most?) of the switches don't blanket-enable LLDP for all ports. They explicitly list ce... [12:51:39] (03PS1) 10Mark Bergsma: Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 [12:51:41] (03PS1) 10Mark Bergsma: Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 [12:52:36] joal: http://wikitech.wikimedia.org/wiki/special:Diff/1760329 [12:53:17] Thanks Zppix :) [12:53:44] joal: hey no problem, it gives me something to do :) [12:54:09] I don't know why but I can't imagine you have nothing to do ;) [12:54:11] Zppix: --^ [12:54:31] joal: i mean 90% of time i dont [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1300). [13:00:05] Urbanecm and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:15] Okay I can SWAT. [13:00:24] Urbanecm: ping? [13:00:28] I'm here [13:01:43] I am idling around if needed [13:01:53] (03Abandoned) 10Dereckson: Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson) [13:02:25] (03PS1) 10BBlack: RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 [13:02:27] (03PS1) 10BBlack: RPS cleanup 2/4: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217 [13:02:29] (03PS1) 10BBlack: RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 [13:02:31] (03PS1) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 [13:02:34] (03PS2) 10Dereckson: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm) [13:02:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm) [13:06:46] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 852.26 seconds [13:06:56] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:06:56] PROBLEM - HP RAID on ms-be1038 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:07:16] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:07:29] yeah yeah, rebalance [13:07:33] silencing [13:07:46] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:10:27] hashar: we're blocked waiting an operations-mw-config-composer-hhvm-jessie if the 'idling around' was for expected CI congestion [13:12:30] (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 (owner: 10BBlack) [13:12:46] (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack) [13:16:46] RECOVERY - Disk space on elastic1021 is OK: DISK OK [13:18:56] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:19:09] (03Merged) 10jenkins-bot: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm) [13:19:18] (03CR) 10jenkins-bot: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm) [13:19:29] Here we are. [13:20:00] live on mwdebug1002 [13:23:15] Okay, works fine. [13:23:58] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add *.esa.int to CopyUploadsDomains (T164643) (duration: 00m 39s) [13:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] T164643: Please add esamultimedia.esa.int to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T164643 [13:27:33] (03PS1) 10Filippo Giunchedi: hieradata: move webrequest 5xx to logstash.svc [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451) [13:29:04] godog, talking about raid checks: https://gerrit.wikimedia.org/r/355190 [13:29:53] we had numerous cases where a policy change caused a performance hit [13:30:16] you and analytics are the main other user of this kind of raid [13:31:37] jynus: nice, so essentially we always want WB [13:31:48] I would assume [13:32:01] and it will ping us id a BBU is faulty [13:32:04] *if [13:32:14] and the policy is wt if bbu error [13:32:32] later we can have better BBU monitoring, but that is hard [13:32:51] this will avoid me (us) thinking "why mysql is slow?" [13:33:11] not sure if you suffered "why swift is slow"? [13:34:39] and I will look into doing the same for hp [13:34:42] not in the same pattern no, there aren't many writes anyways, the write intensive dbs are on ssd [13:34:50] ah, ok [13:35:01] for us, it is the difference betwenn working and lagging [13:35:14] and happened very often for older hosts [13:36:12] tell me if you want to test it more, or I can try deploying it and monitoring the results [13:37:12] !log Run CleanDuplicateScores script to clean up possible duplicates on fawiki before starting to create the UNIQUE keys - https://phabricator.wikimedia.org/T164530 [13:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:52] (03CR) 10Dereckson: [C: 032] Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson) [13:41:41] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, doesn't look like this is rebased on production yet though" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [13:41:57] jynus: ^ [13:42:03] jynus: not a big expert so I can't really give you any good review, but it seems good from what I can read. We principally use raid 0 with single disks on the analytics hosts, so it might affect us as well [13:44:06] what do you mean rebased on production? [13:44:35] elukey: you want wb, I would assume, too? [13:45:10] (03PS1) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 [13:45:22] basically, caching writes based on the promise that the memory with the battery will assure no writes are lost [13:45:50] increasing performance of the underlying disks [13:46:32] (03CR) 10jerkins-bot: [V: 04-1] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack) [13:47:16] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286251 (10ayounsi) I went ahead and ignored those hosts for this specific alert (using T133852#3251556 ) Please reopen the task if needed t... [13:47:26] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286252 (10ayounsi) 05Open>03Resolved [13:47:59] it is a genouine question, I rebased from head, maybe I did something wrong? [13:48:23] yep yep I think this is what we need, write through doesn't seem to be ideal, checking the current config [13:49:03] I can revert this easily if there is any issue, and it doesn't page, so it is "ok" [13:50:03] 06Operations, 10netops: JSNMP flood of errors across multiple switches - https://phabricator.wikimedia.org/T83898#3286267 (10ayounsi) 05Open>03Resolved a:03ayounsi That's not happening anymore. [13:50:14] maybe it is the topic? I used mysql because it is in direct response for T166108 incident [13:50:14] T166108: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108 [13:50:47] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [13:50:52] Current Cache Policy: WriteThrough [13:50:56] oh [13:51:00] where do you have that? [13:51:10] this is spot checking on one analytics host.. [13:51:14] I did not find any host with that [13:51:19] elukey@analytics1033:~$ sudo megacli -LDPDInfo -aAll | grep Cache [13:51:23] maybe the wrong command [13:51:28] no, no, it is ok [13:51:30] (I mean, from my side) [13:52:00] but I checked other analytics hosts and had WB on them [13:52:14] so either it is a mistake, and this patch will detect that [13:52:30] or it is intended, and the patch may be not what it is intended [13:52:55] So afaik our set up is a bit weird (at lesast for me), we have multiple Virtual Drives (12) with one disk in raid0 each.. [13:53:27] I always assumed it was a way to have a sort of JBOD [13:53:46] (03PS2) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 [13:53:48] (03PS2) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 [13:54:05] but now I don't understand why we don't use WB, that should be a good thing even for this weird use case [13:54:06] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: move webrequest 5xx to logstash.svc [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [13:54:22] nice godog ---^ \o/ [13:54:30] python check-raid.py says WARNING: no known controller found [13:55:08] (on analytics1040 I see WB set) [13:55:15] but I see OK: optimal, 13 logical, 14 physical [13:55:26] let me review the execution parameters [13:55:42] you should not vote +1 in this case [13:55:50] vote -1 and we should research more [13:56:04] elukey: \o/ but yeah each disk is indeed in raid0 to do sort of JBOD [13:56:16] (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack) [13:56:23] this is not a popularity contest, it is just coordination! [13:56:30] ahahhaha [13:56:43] sure sure but I think that it is good even for analytics to have a WB check [13:56:45] (03CR) 10jerkins-bot: [V: 04-1] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack) [13:56:52] oh, I was wrong [13:56:58] I lacked root permissions [13:57:01] answer is [13:57:27] P5477 [13:57:33] will check "sudo megacli -LDPDInfo -aAll | grep "Current Cache Policy" | uniq -c" in the meantime across the analytics hosts [13:57:40] oh, no expansion by bugbot [13:57:45] https://phabricator.wikimedia.org/P5477 [13:58:07] the important thing here is to know what that *should* be [13:58:23] and adjust the cleck accordingly [13:58:26] *check [13:59:21] (03CR) 10Ema: [C: 031] Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma) [13:59:41] with analytics hosts, are they hadoop? [13:59:42] (03CR) 10Ema: [C: 031] Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma) [13:59:49] or what, mostly? [14:00:35] yeah hadoop workers [14:00:37] sorry to bother you with this, but this check is quite imporatant for us- to get right, not to rush [14:00:47] nono sorry to slow you down, I am currently checking [14:00:51] this is a good thing [14:00:54] not at all [14:00:58] you are helping [14:01:02] not slowing down [14:01:10] I thank you for that [14:01:23] and if this detects a misconfig for you, you also win something :-) [14:01:59] so WT seems only on an1033 so I was really lucky to find it :D [14:02:01] if it is intended, we can add an option so that certain hosts [14:02:17] (03PS2) 10BBlack: RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 [14:02:17] do not check this, or check a different policy [14:02:19] (03PS2) 10BBlack: RPS cleanup 2/4: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217 [14:02:21] (03PS2) 10BBlack: RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 [14:02:23] (03PS3) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 [14:02:25] (03PS3) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 [14:03:29] The only inconsistency that I see now is ReadAdaptive/ReadAhead, Write Cache OK if Bad BBU / No Write Cache if Bad BBU [14:03:46] do you have a host list? [14:03:50] !log installing nutcracker update in codfw (T163795) [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] T163795: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795 [14:04:04] sure, let me grab the cumin command [14:04:16] wait, those are not write policies [14:04:21] I do not check for those [14:04:25] only the first item [14:04:45] super, so I'd say that only an1033 is misconfigured [14:04:52] I'll fix it straight away [14:05:03] now I am going to review the other non hadoop workers [14:05:13] to see if I have "creative" configurations [14:05:25] (03CR) 10Ema: [C: 031] Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma) [14:05:33] (03CR) 10Ema: [C: 031] Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma) [14:07:19] alternatively, we can deploy this- if we get errors, I revert, if we get few errors, we can check only those [14:07:40] I didn't check every single host, but I checked most of them [14:08:37] (03CR) 10Elukey: [C: 031] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [14:08:44] there you go :) [14:09:07] I will monitor this closely [14:09:13] me too, thanks! [14:09:16] if there is something weird, I will revert [14:09:26] !log re-enabling BGP session to Init7 - T165288 [14:09:29] (03PS12) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) [14:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:34] T165288: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288 [14:09:35] (03CR) 10Ema: [C: 031] RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 (owner: 10BBlack) [14:10:03] I prefer having something, even imperfect [14:12:20] (03PS3) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) [14:12:25] (03CR) 10Dereckson: [V: 032 C: 032] Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson) [14:12:52] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson) [14:14:13] 06Operations, 10netops: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3286326 (10ayounsi) 05Open>03Resolved From Init7: ``` Update: 2017.05.17 09:00 (CEST) First link to AMS-IX has been enabled. Update: 2017.05.18 09:40 (CEST) Second link to AMS-IX has been enab... [14:14:21] (03CR) 10Jcrespo: [C: 032] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [14:15:36] (03CR) 10Hashar: [C: 031] ci: Docker registry for container builds [puppet] - 10https://gerrit.wikimedia.org/r/345422 (https://phabricator.wikimedia.org/T161657) (owner: 10Dduvall) [14:15:51] (03CR) 10Hashar: [C: 031] [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall) [14:17:18] (03Merged) 10jenkins-bot: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson) [14:17:27] (03CR) 10jenkins-bot: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson) [14:21:30] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable NewUserMessage on dty.wikipedia (T166121) (duration: 00m 38s) [14:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:38] T166121: Enable Extension:NewUserMessage on Doteli Wikipedia - https://phabricator.wikimedia.org/T166121 [14:21:59] !log EU SWAT done [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:22] !log deploying new check_raid monitoring write policy for megacli T166108 [14:25:28] \o/ [14:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:29] T166108: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108 [14:27:37] already rolled on some hosts, looking good so far [14:27:59] tin is failing [14:28:36] failing with what? [14:28:40] lol [14:28:50] CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) [14:29:14] ah XD [14:29:20] not necesarily a check error [14:29:27] (03CR) 10Mark Bergsma: [C: 032] Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma) [14:31:23] 06Operations, 10ops-eqiad, 10netops: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3286368 (10faidon) OK, I see the prompt in the console: ``` CentOS release 6.9 (Final) Kernel 2.6.32-696.1.1.el6.x86_64 on an x86_64 us-qas-as14907.anchors.atlas.ripe.net login: ``` We don't have the... [14:31:25] it could make sense there [14:31:58] there, disk performance is not that important, and consistency is a must [14:32:41] dataset1001 has the same issue, although there could me more doubtful [14:33:31] not reverting yet, as it will probably only hit 2-3 hosts [14:33:42] (for now) [14:33:54] yeah, let's wait a bit more [14:34:10] is apergos around? [14:34:49] (03Merged) 10jenkins-bot: Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma) [14:34:56] (03PS8) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [14:35:35] I think tin makes sense (there is no BBU present there - or I cannot see it) [14:35:47] no BBU? [14:36:01] Either no BBU or I cannot see it [14:36:03] then we should be able to change the policy [14:36:06] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2026842 [14:36:14] with no effect, wouldn't we? [14:36:37] sorry, I was thinking on cache [14:36:42] not the battery [14:36:47] forget what I just said [14:37:06] PROBLEM - MegaRAID on tin is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) [14:37:07] ACKNOWLEDGEMENT - MegaRAID on tin is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166136 [14:37:10] 06Operations, 10ops-eqiad: Degraded RAID on tin - https://phabricator.wikimedia.org/T166136#3286375 (10ops-monitoring-bot) [14:37:19] lol [14:37:20] jynus: hehe :) [14:37:27] ahahah [14:37:37] these bots are taking over our jobs [14:37:39] automation to the rescue :-P [14:38:17] so, becaue we have 2 different scripts [14:38:26] we have to change the other, too [14:38:34] :-( [14:38:36] I always imagine the ops bot with volans face saying: This RAID isn't good, amigo! [14:38:57] PROBLEM - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) [14:38:58] ACKNOWLEDGEMENT - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166137 [14:39:01] 06Operations, 10ops-eqiad: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T166137#3286381 (10ops-monitoring-bot) [14:39:15] can we kill temporarelly that bot? [14:39:16] btw jynus one of the thing I was trying to say before but failed to explain myself, was that we should check that there is physically a BBU (not faulty) [14:39:40] it's not a bot it's called by icinga as a handler [14:39:53] who is the active one, tegmen or einsteinium? [14:40:33] it used to be tegmen, but double check [14:40:54] analytics1039 is next [14:41:13] db1046, but that is a genuine problem [14:41:31] helium [14:41:37] which is backups, so makes sense [14:41:37] and restbase hosts [14:41:41] let's revert [14:41:50] sounds good [14:41:51] and check conditionally [14:42:07] (03PS1) 10Jcrespo: Revert "raid-check: Return critical when not in WriteBack mode for megacli" [puppet] - 10https://gerrit.wikimedia.org/r/355228 [14:42:17] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "raid-check: Return critical when not in WriteBack mode for megacli" [puppet] - 10https://gerrit.wikimedia.org/r/355228 (owner: 10Jcrespo) [14:42:24] dataset1001 looks like it has a healthy BBU though, so maybe the configuration is simply wrong there :) [14:42:48] !log temporarily disabled raid_handler and puppet on tegmen [14:42:55] it is ok [14:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] so we create a paramenter and use it only on db, es, pc, ms-be and analytics hosts? [14:44:32] or you detect if a BBU is physically present, not sure if it's tricky [14:44:46] no [14:44:51] even if BBU is there [14:45:07] there are some hosts that I do not see badly to have the slowest option [14:45:14] think tin and helium [14:45:19] probably others [14:45:26] Yeah, I think the parameter is a better approach indeed. like dataset1001 it has WT, and it might be a good idea there (I don't know) [14:45:46] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) [14:45:46] PROBLEM - MegaRAID on analytics1039 is CRITICAL: CRITICAL: 13 LD(s) not in WriteBack policy (WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough) [14:45:47] PROBLEM - MegaRAID on prometheus2003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:45:56] PROBLEM - MegaRAID on analytics1033 is CRITICAL: CRITICAL: 13 LD(s) not in WriteBack policy (WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough) [14:46:13] prometheus is probably a mistake, assuming the right hardware? [14:46:35] (03CR) 10Mforns: [C: 031] "LGTM! :]" [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [14:46:56] PROBLEM - DPKG on thorium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:47:16] PROBLEM - MegaRAID on helium is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) [14:47:22] marostegui: if you have free time, can you give a look at db1046? [14:47:27] yes [14:47:31] I will run puppet to recover [14:47:46] PROBLEM - MegaRAID on restbase-test2002 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:48:13] 06Operations, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3286412 (10elukey) [14:48:46] PROBLEM - MegaRAID on restbase-test2003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:49:14] (03PS9) 10Ottomata: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [14:49:29] jynus: I am here now [14:49:49] the thing here is to create some tickets to review config [14:49:54] PROBLEM - MegaRAID on rdb1003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:50:17] and differenciate between intended configurations and mistaken configurations [14:50:21] jynus: I am creating the task for db1046 [14:50:48] db1046 for sure is not intended [14:51:03] other thing is if we can change it, due to BBU issues [14:51:31] (03CR) 10Hashar: "That is a good first pass!!" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [14:52:14] ^ thorium is fine, jdk update in progress [14:52:41] or if I am a jerk and lazy, we enable it on databases, and "Every man for himself" [14:53:20] I would bet most of those, with few exceptions, are misconfigs [14:55:34] PROBLEM - MegaRAID on rdb1004 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:55:44] RECOVERY - MegaRAID on analytics1039 is OK: OK: optimal, 13 logical, 14 physical [14:56:04] RECOVERY - MegaRAID on analytics1033 is OK: OK: optimal, 13 logical, 14 physical, WB policy [14:56:05] PROBLEM - MegaRAID on prometheus1003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough) [14:56:23] (03PS1) 10Ema: bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229 [14:56:42] !log otto@tin Started deploy [eventlogging/analytics@25f8096]: (no justification provided) [14:56:46] !log otto@tin Finished deploy [eventlogging/analytics@25f8096]: (no justification provided) (duration: 00m 04s) [14:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:59] jynus: in my case you discovered a big misconfig mess in analytics* hosts, so thanks :) [14:57:04] RECOVERY - MegaRAID on tin is OK: OK: optimal, 1 logical, 2 physical [14:57:15] (03CR) 10Ottomata: [C: 032] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [14:57:54] RECOVERY - DPKG on thorium is OK: All packages OK [14:59:04] RECOVERY - MegaRAID on dataset1001 is OK: OK: optimal, 3 logical, 36 physical [14:59:09] elukey: what would be the best way to proceed, do I create a megaticket, one ticket per group? [15:00:00] 06Operations, 10Analytics, 10DBA: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286450 (10Marostegui) [15:00:14] (03CR) 10Paladox: "Needs uploading to apt.wikimedia.org :)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355155 (owner: 10Chad) [15:01:13] jynus: I already created a task, it might be better to collect them in a tracking one.. what do you think? [15:01:28] please share the number [15:01:29] T166140 [15:01:29] T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140 [15:01:32] thanks [15:01:44] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:44] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:45] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:45] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:46] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:46] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:47] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__) [15:01:58] that is not me [15:01:58] !log otto@tin Started deploy [eventlogging/analytics@UNKNOWN]: (no justification provided) [15:02:00] !log otto@tin Finished deploy [eventlogging/analytics@UNKNOWN]: (no justification provided) (duration: 00m 02s) [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:34] (03CR) 10Filippo Giunchedi: "LGTM, it could also include Bug: T165043" [puppet] - 10https://gerrit.wikimedia.org/r/355110 (owner: 10Muehlenhoff) [15:05:18] (03PS1) 10Jcrespo: Revert "Revert "raid-check: Return critical when not in WriteBack mode for megacli"" [puppet] - 10https://gerrit.wikimedia.org/r/355231 [15:05:34] RECOVERY - MegaRAID on rdb1004 is OK: OK: optimal, 2 logical, 4 physical [15:05:44] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical [15:05:45] RECOVERY - MegaRAID on prometheus2003 is OK: OK: optimal, 2 logical, 6 physical [15:07:14] RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical [15:07:44] RECOVERY - MegaRAID on restbase-test2002 is OK: OK: optimal, 2 logical, 2 physical [15:08:44] RECOVERY - MegaRAID on restbase-test2003 is OK: OK: optimal, 2 logical, 2 physical [15:09:54] RECOVERY - MegaRAID on rdb1003 is OK: OK: optimal, 2 logical, 4 physical [15:13:19] 06Operations, 10ops-eqiad, 15User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3286474 (10Joe) Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decom... [15:16:04] RECOVERY - MegaRAID on prometheus1003 is OK: OK: optimal, 2 logical, 6 physical [15:17:26] (03PS3) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [15:17:33] (03PS4) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) [15:17:35] 06Operations, 10ops-eqiad: Degraded RAID on tin - https://phabricator.wikimedia.org/T166136#3286485 (10Volans) 05Open>03Invalid This was a raid check false positive [15:17:54] 06Operations, 10ops-eqiad: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T166137#3286488 (10Volans) 05Open>03Invalid This was a raid check false positive [15:18:04] gtk [15:21:58] volans: I think it can be enabled now [15:22:36] (03PS7) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [15:22:45] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286505 (10Papaul) a:05akosiaris>03Papaul [15:22:57] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286506 (10akosiaris) Racking proposal sounds fine! [15:23:33] !log re-enabled raid_handler and puppet on tegmen [15:23:37] jynus: done, thanks [15:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:56] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286507 (10Papaul) [15:24:07] this was not worthless, it found some interesting things [15:24:21] my question is if the script should be executed every time? [15:24:35] which script? [15:24:48] the one creating tasks [15:25:21] it acknoledges all errors, and that would be the opposite of what we want [15:25:42] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3286512 (10fgiunchedi) [15:25:48] (03PS1) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 [15:26:10] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753028 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done! Kibana dashboard https://logstash.wikimedia.org/app/kibana#/dashb... [15:26:13] or I should change it to create a task with different wording [15:26:19] jynus: it skips some errors, like timeout and connection refused [15:26:28] (03PS2) 10Gehel: logstash - apifeature indices need to be cleaned up [puppet] - 10https://gerrit.wikimedia.org/r/353560 [15:26:28] so if you prefer we could just make it skip this one too [15:26:31] as you want [15:26:32] so we can make it skipt that one [15:26:35] that would work for me [15:27:38] is 'not in WriteBack policy' specific wnough to match? [15:27:50] wait [15:28:09] don't change things yet, I have to redo the other script [15:28:38] I will ping you or add it myself once this is done [15:28:41] (03CR) 10Gehel: [C: 032] logstash - apifeature indices need to be cleaned up [puppet] - 10https://gerrit.wikimedia.org/r/353560 (owner: 10Gehel) [15:28:42] I'm not changing anything, you can do it yourself, line 22 in raid_handler.py, in SKIP_STRINGS [15:28:51] thanks [15:28:57] has to be something specific that don't appears in the other errors :D [15:29:11] yes, it should be easy [15:29:25] PROBLEM - HHVM rendering on mw2104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:15] RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 74095 bytes in 0.227 second response time [15:32:23] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3286535 (10elukey) [15:32:38] anybody checking mobile apps? [15:32:48] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286536 (10ayounsi) LLDP added to all the interfaces in asw-ulsfo/eqiad. Already configured in codfw and esams. Which solves the first part of the issue above (minus the devices where lldp crashes). About the... [15:34:05] (03CR) 10DCausse: [C: 04-1] elasticsearch - deploy elasticsearch-curator along with elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel) [15:34:38] mobrovac: around? [15:35:03] /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: 'NoneType' object has no attribute '__getitem__'): [15:35:06] {} [15:36:24] (03PS1) 10Ottomata: Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced) [puppet] - 10https://gerrit.wikimedia.org/r/355238 (https://phabricator.wikimedia.org/T67508) [15:37:30] (03PS2) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 [15:39:29] (03CR) 10DCausse: [C: 031] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel) [15:39:46] (03PS3) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 [15:40:21] (03CR) 10Ottomata: [C: 032] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced) [puppet] - 10https://gerrit.wikimedia.org/r/355238 (https://phabricator.wikimedia.org/T67508) (owner: 10Ottomata) [15:40:49] (03CR) 10Filippo Giunchedi: [C: 031] elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel) [15:41:54] (03CR) 10Mark Bergsma: [C: 032] bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229 (owner: 10Ema) [15:42:06] (03PS4) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 [15:42:08] (03CR) 10Mark Bergsma: [C: 032] Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma) [15:42:45] (03CR) 10Mark Bergsma: [C: 032] Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma) [15:43:02] (03Merged) 10jenkins-bot: Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma) [15:43:05] (03CR) 10Mark Bergsma: [C: 032] Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma) [15:43:16] (03Merged) 10jenkins-bot: Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma) [15:43:34] (03CR) 10Gehel: [C: 032] elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel) [15:43:51] (03Merged) 10jenkins-bot: Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma) [15:43:53] (03Merged) 10jenkins-bot: bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229 (owner: 10Ema) [15:50:05] !log Stop replication on dbstore1002 s7 thread for maintenance - T163190 [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:14] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [15:55:15] urandom: you there? [15:55:58] elukey: ya [15:57:47] urandom: hello! There are a lot of CRITICALs for mobile apps but I don't get what's wrong [15:58:18] I always confuse where to check for service-checker-swagger test [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1600). [16:00:06] elukey: where to check? [16:00:31] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3286585 (10fgiunchedi) [16:00:36] urandom: yeah where the tests are defined [16:01:27] elukey: oh, ttbmk, those are part of the swagger spec, and are tested by a script "locally" [16:01:41] yep but where are the defs ? [16:01:43] where "locally" here means the cluster the service running on [16:01:59] I mean, it runs some tests but where are those defined ? [16:02:10] in any case, something seems wrong with mobile apps [16:02:12] in the swagger spec for that endpoint, let me dig [16:02:21] (03PS5) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [16:02:27] elukey: yeah, i know very little about this but have pinged Pchelolo [16:03:19] super thanks [16:03:20] * Pchelolo is looking into it [16:03:24] o/ [16:03:35] no puppet swat patches btw [16:03:37] https://i.imgur.com/h11ujSy.gifv [16:03:47] I can see some 500s with query error on /srv/log/mobileapps/main.log [16:03:51] on scb1004 for example [16:04:22] godog: I've been working on that vhost stuff for scap ^^ [16:04:32] (it's probably not swat-ready, but could use another bit of review) [16:04:48] (03CR) 10jerkins-bot: [V: 04-1] Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [16:05:39] gdi jenkins [16:05:58] elukey: https://en.wikipedia.org/w/index.php?title=Cat&oldid=781841025 [16:06:19] Derp. [16:06:52] Pchelolo: --verbose :D [16:07:09] heh [16:07:16] should fix itself in a bit, the vandalism was reverted. The endpoint uses page images page prop that's updated asyncronously via the jobqueue [16:08:00] elukey: heh sorry. The problem's that that endpoint should fetch the images associated with the page and abviously ofter that bandalism there's no images [16:08:13] the checker script though could've done better job reporting the issue [16:08:30] RainbowSprinkles: ok! thanks I'll take a look likely tomorrow [16:08:34] Pchelolo: what are the steps to determining this? [16:08:37] Pchelolo: what is the workflow to follow in these cases? I mean, where did you check? Just to learn a bit [16:08:42] (03PS6) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [16:08:58] godog: Ok thx [16:09:31] Pchelolo: what elukey said [16:09:33] :) [16:10:20] urandom: elukey: step 1 - ssh to scb node and run `check-mobileapps`, step 2 - verify that the check is indeed failing. step 3 - go to mobileapps source code and see what's the check is doing step 4 - got look what happened with the wiki page the check is using recently [16:10:31] i assume you can mine the mobileapps source for the swagger spec to determine what the test is, but is there an easier way? [16:10:44] Oh. [16:11:08] so that would be a No. :) [16:11:23] urandom: no easier way, but it's a good idea to make the checker script log the x-amples spec for a failed check [16:11:40] Woths creating a ticket [16:11:48] worths [16:12:08] I remember we ran into this before, I think it was with Obama [16:12:17] see if I can find the task [16:12:38] Pchelolo: yeah I followed up to 3. but then it was a bit difficult [16:13:01] especially since I don't know what pages it was testing [16:13:08] if godog doesn't find the task I'll create one [16:13:17] (03PS3) 10BBlack: RPS Cleanup 1/5: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 [16:13:20] (03PS3) 10BBlack: RPS cleanup 3/5: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217 [16:13:21] (03PS3) 10BBlack: RPS cleanup 4/5: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 [16:13:23] (03PS4) 10BBlack: RPS cleanup 5/5: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 [16:13:24] https://phabricator.wikimedia.org/T150560 [16:13:25] (03PS4) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 [16:13:28] (03PS1) 10BBlack: RPS cleanup 2/5: remove irqbalance module [puppet] - 10https://gerrit.wikimedia.org/r/355243 [16:13:28] Pchelolo: ^ [16:13:34] And it takes quite some time for the JobQueue to update the page properties now.. [16:13:44] (03PS1) 10Framawiki: Create a new namespace "Vikiproje" for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102) [16:13:59] (03CR) 10BBlack: [V: 032 C: 032] RPS Cleanup 1/5: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 (owner: 10BBlack) [16:14:11] cool, thank you godog. I'll write a comment there indicting we've run into this again and raise the priority [16:14:31] np Pchelolo ! [16:14:41] (03CR) 10BBlack: [V: 032 C: 032] RPS cleanup 2/5: remove irqbalance module [puppet] - 10https://gerrit.wikimedia.org/r/355243 (owner: 10BBlack) [16:14:44] ah no I was misremembering, it was the Dog page, (retrieve page preview of Dog page) [16:14:48] elukey: https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/spec.yaml#L534 [16:15:11] (03CR) 10BBlack: [C: 032] RPS cleanup 3/5: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217 (owner: 10BBlack) [16:15:25] I think in this case just syslog'ing more info would be enough [16:15:26] !log cp1074: enable prometheus node_exporter qdisc collector T147569 [16:15:29] godog: https://giphy.com/gifs/obama-mic-drop-out-3o7qDSOvfaCO9b3MlO [16:15:31] (03CR) 10BBlack: [C: 032] RPS cleanup 4/5: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 (owner: 10BBlack) [16:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:34] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [16:15:45] thanks urandom [16:15:49] elukey: haha yeah that's me in 30 min or so [16:15:49] (03CR) 10BBlack: [C: 032] RPS cleanup 5/5: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack) [16:15:55] (03CR) 10BBlack: [C: 032] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack) [16:16:03] elukey: i realize that's not very helpful... [16:16:21] ahhhh not I got the error message from the check [16:16:40] !log disabled puppet on all lvs* for RPS-related deployments [16:16:41] sorry Pchelolo I am dumb, EOD hit me [16:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [16:16:55] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [16:16:55] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [16:16:55] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [16:16:55] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [16:16:56] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [16:16:56] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:16:57] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:16:57] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [16:16:58] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:16:58] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:16:58] a bit more info would have helped but I didn't read the error message correctly [16:16:59] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:17:05] (the Cat reference) [16:17:13] !log disabled puppet on all cp* for RPS-related deployments (just in case!) [16:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:22] here we go ^^^ [16:17:45] actually I didn't notice the Cat in the error description as well.. [16:17:51] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3286648 (10fgiunchedi) a:03Papaul @papaul please replace, thanks! Also note that this drive has very few power on hours, almost DOA ``` => pd 1I:1:8 show array f show Smart Array P840 in Slot 3 array F... [16:18:10] (03Abandoned) 10Jcrespo: Revert "Revert "raid-check: Return critical when not in WriteBack mode for megacli"" [puppet] - 10https://gerrit.wikimedia.org/r/355231 (owner: 10Jcrespo) [16:22:38] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286656 (10Pchelolo) Same thing happened today for one of the mobile-apps checks: ``` /{domain}/v1/page/media/{title} (... [16:23:53] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286659 (10elukey) Today we received some CRITICALs from all the scb hosts and the error message was the following: ```... [16:23:58] Pchelolo: --^ [16:24:10] argh at the same time [16:24:11] sorry [16:24:27] removed [16:24:30] !log demon@tin Synchronized README: testing (duration: 00m 38s) [16:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:39] (03PS1) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) [16:25:04] elukey: hehe :) I've also added the idea that we might log a `curl` command that could be used to repeat the request - would be really easy to follow up in no time [16:25:17] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286676 (10fgiunchedi) Since icinga space for reporting the check output is limited anyway I think in this case addition... [16:25:25] (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [16:25:44] (03PS2) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) [16:25:52] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10Eevans) >>! In T150560#3286656, @Pchelolo wrote: > > [ ... ] > > Even more neat feature would be to convert t... [16:26:26] Pchelolo: +1, thanks! [16:26:34] !log puppet re-enables on caches [16:26:35] (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:43] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10jcrespo) The previous patch was reverted, I am creating a separate one to allow to enable or disable the extra check at will (for megacli first). [16:28:55] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:29:27] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3286698 (10jcrespo) ``` root@prometheus1003:~$ python check-raid.py OK: optimal, 2 logical, 6 physical OK root@prometheus1003:~$ python check-raid.py --policy=WriteBack CRITICAL:... [16:32:31] (03CR) 10Framawiki: "Hello Hoo man, please add this patch in the list of deployments at wikitech . Thanks !" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man) [16:32:50] * elukey off [16:35:24] (03PS3) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) [16:36:18] (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [16:41:10] (03PS4) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) [16:43:11] !log BBR: enabling mq+fq on cp1074 - T147569 [16:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:18] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [16:43:47] (03PS5) 10Chad: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 [16:45:25] (03PS2) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 [16:45:51] bblack: 341729 is for you, btw :) [16:46:56] (03PS3) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 [16:47:23] It's the last of the stuff in files/* :D [16:47:37] (03PS4) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) [16:47:51] RainbowSprinkles: you might want to check this out from quick git grep: [16:47:54] utils/create_ecdsa_cert: "${_dir}/../files/ssl/${name}.crt" [16:47:56] utils/create_ecdsa_cert:git add "${_dir}/../files/ssl/${name}.crt" [16:48:01] looks probably-relevant [16:48:05] Ahhh [16:48:08] Missed those [16:48:13] utils/* confuses me :p [16:49:12] !log BBR: enabling bbr on cp1074 - T147569 [16:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:20] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [16:50:10] (03PS4) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 [16:50:27] !log upgrading facter on mw[2250-2259] as a test batch [16:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:31] (03CR) 10Jcrespo: [C: 031] "I would like to deploy this, as it would be backwards compatible with the current behaviour, and then enable the new check selectively. Ye" [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [16:54:24] (03CR) 10Jcrespo: [C: 031] "Example usage:" [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [16:55:59] (03PS5) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) [16:59:25] 06Operations, 06Discovery-Search (Current work): replace es-tool with elasticsearch-curator for standard elasticsearch operations - https://phabricator.wikimedia.org/T166154#3286827 (10Gehel) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1700). Please do the needful. [17:00:38] Deployment of services happen too soon in respect to my old timezone [17:00:49] anyway, I want to try deploying again [17:02:45] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:03:00] !log ladsgroup@tin Started deploy [ores/deploy@4874809]: Trying again with deploying ores [17:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:02] (03PS1) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 [17:05:42] 06Operations, 10ops-codfw, 10netops: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3286863 (10Papaul) [17:09:04] !log starting branch cut for 1.30.0-wmf.2 T163512 [17:09:05] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3286899 (10Ottomata) ping @akosiaris @Joe [17:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:13] T163512: MW-1.30.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T163512 [17:09:22] ^ jdlrobson anomie FYI [17:24:17] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3286942 (10Ottomata) Also, if configuration of profiles can only be done via hiera, doesn't that mean any module parameter that we may want to override... [17:24:29] !log ladsgroup@tin Finished deploy [ores/deploy@4874809]: Trying again with deploying ores (duration: 21m 30s) [17:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:30] it all looks okay [17:25:48] tell me if any alarm started to fire off [17:26:58] Amir1: well, actually, we've an hypthesis for this alarm: a IFTTT component [17:27:35] Amir1: nevermind, I mixed two conversations [17:27:44] Dereckson: oh, okay [17:27:45] :) [17:27:53] Nice meeting you by the way [17:28:05] Yes:) [17:28:42] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286944 (10Volans) a:05Volans>03None [17:30:55] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.57 seconds [17:45:06] 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3286990 (10jcrespo) I've checked, and the currently in use check does too much, probably we do not need such a thorough check every time icinga runs, wh... [17:45:57] Hi ops! [17:46:13] I'm asking about this in releng too, but it might be something you can help with [17:46:31] CI is failing trying to connect to packagist.org [17:47:24] The [17:47:26] "https://packagist.org/p/provider-2017-04%2486c7cf8d14faebc894d0a52237f6873b18b913101a4d6b8f50fbac618900cd15.json" file could not be downloaded: Failed to enable crypto [17:47:45] what is that, curl? [17:47:58] or alternately Failed to decode response: zlib_decode(): data error [17:48:21] jynus: it's from composer install, probably using curl internally [17:48:43] "Ok I have found a solution. The problem is that the site uses SSLv3" [17:48:46] maybe an IP address update / firewall issue? [17:48:53] seems an error by openssl [17:48:59] ooh, and nothing better? [17:49:05] that's bizarre [17:49:21] curl should negotiate up if both sides support it, right? [17:50:21] Oh, following up from other channel: we haven't seen this on other jobs. Is it possible it was transient? [17:50:22] well, over in -releng, RainbowSprinkles says it's probably something that team should be able to solve [17:50:53] RainbowSprinkles: could be, but it's happened at least 4 times in the past half hour [17:51:16] It is? [17:51:17] https://integration.wikimedia.org/ci/job/wikimedia-fundraising-crm-composer-php55-trusty/ [17:51:17] it is an external site-anything could be [17:51:20] that message or "Failed to decode response: zlib_decode(): data error", followed by half an hour of nothing [17:51:21] I only see the one failure [17:51:29] https://integration.wikimedia.org/ci/job/mwgate-composer-php55-trusty/3110/console [17:51:50] the zlib_decode() is because it's fetching bogus data. The real error is the ssl/tls issue with packagist. [17:51:55] https://www.ssllabs.com/ssltest/analyze.html?d=packagist.org&s=144.217.203.53&latest seems ok [17:52:11] Hmm [17:52:17] huh, some succeed though [17:52:17] TLS1.2, modern compression, etc. [17:52:23] *cipher [17:52:35] but remember I am not your traffic guy [17:52:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:52:59] Yeah, this is weird. [17:53:29] sites can fail [17:53:49] (03PS1) 10Thcipriani: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 [17:55:20] jynus: Yeah that's why I suggested maybe transient [17:59:39] maybe that can be cached for increased reliability? [18:02:09] https://packagist.org could not be fully loaded, package information was loaded from the local cache and may be out of date [18:02:13] Hehe, it does ;-) [18:03:13] RainbowSprinkles: but after that message, it just stalls for half an hour [18:03:26] so something's not quite working with the caching [18:04:33] ejegg RainbowSprinkles, i think that is being caused by https://github.com/composer/packagist/commit/e77ad7072b7c545d447c5c9d269a3682f90fb0b7 [18:06:05] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 291790 [18:07:49] (03PS1) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 [18:08:17] paladox: hmm, I don't see how that would break like that [18:08:36] It's caching the api requests for longer. [18:08:52] Though when did the problem start? In the last 3 hours? [18:08:56] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 (owner: 10Krinkle) [18:10:11] paladox yeah, it started recently. But the issue isn't stale packages, it's connection failures followed by half-hour waits when it tries to fall back to locally cached packages [18:11:14] ok [18:11:50] (03PS2) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 [18:13:39] ejegg an old report, https://github.com/composer/composer/issues/4212 [18:13:47] but maybe the workaround will work for you? [18:14:36] Oh, that is composers site not packagist. [18:15:27] hmm, well, some builds are working [18:15:40] I guess I'll just keep recheck-ing till mine goes through [18:17:03] thcipriani: hi! Mmm looks like you already cut the wmf.2 branch? Mmm somehow the CentralNotice update to the deploy branch didn't get automatically pushed to core as it used to...... [18:17:23] (03PS1) 10Ottomata: Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166162) [18:17:34] (03CR) 10Jcrespo: [C: 032] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [18:18:18] (03PS2) 10Ottomata: Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166164) [18:20:53] AndyRussG: I'm not sure what you mean, afaict CentralNotice is pinned to the wmf_deploy branch: https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json#L178 [18:22:45] AndyRussG: what do you mean when you say "automatically pushed to core"? [18:22:53] https://github.com/wikimedia/mediawiki/blob/wmf/1.30.0-wmf.2/.gitmodules#L1-L4 seems right based on config.json [18:23:01] thcipriani: I was wrong, now it looks OK.... (????) [18:23:08] Dunno what I'm doing wrong on my local repo [18:23:36] AndyRussG: kk, let me know if you find anything further amiss with that extension. [18:29:41] thcipriani: thx!!! yeah just double-checked again, all good. yes that config.son is also great.. :) [18:29:53] cool :) [18:30:05] * AndyRussG browses bash history to try to identify origin of silly git-submodule-confusion [18:38:52] (03PS2) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [18:39:11] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari) [18:39:38] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287159 (10kaldari) [18:40:10] (03PS3) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 [18:40:12] (03PS8) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [18:40:40] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari) [18:40:57] (03PS4) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 [18:41:05] (03PS9) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [18:41:21] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari) @Tnegrin: Could you approve this access request? [18:42:30] (03PS10) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [18:47:05] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1900). [19:00:10] * thcipriani does [19:05:46] (03PS3) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [19:06:22] !log thcipriani@tin Started scap: testwiki to php-1.30.0-wmf.2 and rebuild l10n cache [19:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:15] (03PS1) 10Papaul: DNs: Add mgmt and production DNS entries for ores200[1-9] [dns] - 10https://gerrit.wikimedia.org/r/355270 [19:14:04] (03CR) 10Dzahn: [C: 032] DNs: Add mgmt and production DNS entries for ores200[1-9] [dns] - 10https://gerrit.wikimedia.org/r/355270 (owner: 10Papaul) [19:15:25] (03PS4) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [19:16:05] (03CR) 10Dzahn: "[bast1001:~] $ for orescodfw in $(seq 1 9); do host ores200${orescodfw}.codfw.wmnet; done" [dns] - 10https://gerrit.wikimedia.org/r/355270 (owner: 10Papaul) [19:16:05] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:16:59] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10Dzahn) ``` [bast1001:~] $ for orescodfw in $(seq 1 9); do host ores200${orescodfw}.codfw.wmnet; done ores2001.codfw.wmnet has address 10.192.0.12 ores2002.codfw.wmnet has add... [19:21:32] (03PS5) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [19:25:35] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:07] (03CR) 10Jcrespo: "This now does what it is supposed to do, but the style is not very good, can you give me some constructive criticism about that?" [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [19:27:35] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74094 bytes in 0.415 second response time [19:28:15] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:05] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.203 second response time [19:30:21] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287255 (10Tnegrin) approved [19:31:04] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3287257 (10Jgreen) @BBlack ok I upgraded nginx and *ssl, and civicrm and the other frack-hosted sites should be fixed to include the HSTS header... [19:31:41] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287260 (10kaldari) [19:34:14] !log thcipriani@tin Finished scap: testwiki to php-1.30.0-wmf.2 and rebuild l10n cache (duration: 27m 52s) [19:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:36] (03CR) 10Jcrespo: "I forgot to paste the compilation results: https://puppet-compiler.wmflabs.org/6506/" [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [19:42:31] (03CR) 10Thcipriani: [C: 032] Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani) [19:45:25] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:35] (03Merged) 10jenkins-bot: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani) [19:48:48] (03CR) 10jenkins-bot: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani) [19:52:06] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.30.0-wmf.2 [19:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:56] (03PS1) 10BBlack: caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) [19:57:37] (03CR) 10jerkins-bot: [V: 04-1] caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack) [20:03:24] (03PS2) 10BBlack: caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) [20:06:22] !log disabling puppet on all caches for BBR deploy control [20:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] (03CR) 10BBlack: [C: 032] caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack) [20:10:37] !log enable BBR for all caches @ ulsfo - T147569 [20:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:44] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [20:14:25] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:20:35] !log enable BBR for all caches @ codfw - T147569 [20:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:44] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [20:21:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:21:55] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:22:35] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:22:35] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:22:35] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:22:35] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:22:35] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:23:25] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [20:23:26] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [20:23:35] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:23:35] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [20:23:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [20:23:36] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [20:23:45] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [20:24:32] (03PS1) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 [20:24:55] !log enable BBR for all caches - T147569 [20:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:42] (03CR) 10Chad: [C: 031] "This is fine, minus a nit" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:28:03] (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:28:33] (03PS2) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 [20:29:46] (03CR) 10Daniel Kinzler: [C: 031] Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:31:20] (03CR) 10Daniel Kinzler: [C: 031] Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:32:41] (03CR) 10Krinkle: [C: 031] Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:33:28] (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:33:42] (03CR) 10Krinkle: [C: 04-1] Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:34:38] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3287404 (10RobH) [20:34:41] 06Operations, 10ops-codfw, 10netops: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3287401 (10RobH) 05Open>03Resolved a:03RobH Done! [20:34:45] (03PS3) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 [20:34:59] (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [20:35:59] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10RobH) [20:36:13] (03PS7) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 [20:38:35] PROBLEM - HP RAID on ms-be1036 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [20:42:56] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3287418 (10RobH) [20:43:15] PROBLEM - Disk space on ms-be1008 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error [20:53:02] (03CR) 10Bearloga: "@gehel Thanks for uploading shiny-server to the apt repo! I don't think there's anything else this patch depends on. Is there a way to tes" [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [20:54:55] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1] [20:55:32] 06Operations, 10Monitoring, 06Multimedia: Create grafana dashboard for video scaler job runners - https://phabricator.wikimedia.org/T163033#3184054 (10Krinkle) [20:56:15] PROBLEM - MegaRAID on ms-be1008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [20:56:26] ACKNOWLEDGEMENT - MegaRAID on ms-be1008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166177 [20:56:29] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3287545 (10ops-monitoring-bot) [20:57:13] (03CR) 10Volans: [C: 032] Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans) [20:59:41] RECOVERY - HP RAID on ms-be1036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:00:20] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3287584 (10Volans) [21:01:16] RECOVERY - Disk space on ms-be1008 is OK: DISK OK [21:02:33] (03PS11) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [21:04:33] 06Operations, 10Monitoring: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3287592 (10RobH) [21:04:45] 06Operations, 10ops-codfw, 10Monitoring: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3287609 (10RobH) [21:07:21] (03PS1) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [21:09:48] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install resetbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10RobH) [21:10:00] (03CR) 10jerkins-bot: [V: 04-1] varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:11:07] (03PS2) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [21:11:34] (03PS3) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [21:11:39] (03CR) 10Krinkle: "(whitespace)" [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:16:36] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:47] (03PS1) 10BBlack: r::c::perf - raise fq flow_limit to 300 [puppet] - 10https://gerrit.wikimedia.org/r/355356 (https://phabricator.wikimedia.org/T147569) [21:27:12] (03CR) 10BBlack: [V: 032 C: 032] r::c::perf - raise fq flow_limit to 300 [puppet] - 10https://gerrit.wikimedia.org/r/355356 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack) [21:36:37] (03PS2) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [21:37:23] (03CR) 10jerkins-bot: [V: 04-1] Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [21:40:49] (03PS3) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [21:44:36] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [21:47:46] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:49:46] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:50:09] ACKNOWLEDGEMENT - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1] Volans failed disk, https://phabricator.wikimedia.org/T166177 [22:04:16] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:12:17] (03PS1) 10Volans: Puppet: run-puppet-agent improvements [puppet] - 10https://gerrit.wikimedia.org/r/355363 [22:13:06] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:13:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:13:56] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [22:14:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [22:15:27] (03CR) 10Volans: [C: 032] "@godog: I'm merging this to fix the bug, feel free to comment it also later and I'll address the comments tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/355363 (owner: 10Volans) [22:29:40] (03PS1) 10Chad: scap clean: Some docs, minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355365 [22:33:18] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:41:06] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:16] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:46] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:46] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:47] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:47] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:41:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [22:43:46] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [22:43:46] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [22:43:46] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [22:43:46] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [22:43:46] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [22:43:56] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [22:44:06] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T2300). [23:25:41] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#3287937 (10Dzahn) new docs on how to use the Google group: https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Maintain_the_.27maint-announce.27_mails_and_calendar [23:30:32] (03PS2) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [23:34:21] (03PS3) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [23:51:51] (03PS4) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [23:52:51] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [23:58:15] (03PS5) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599