[00:02:55] (03CR) 10Chad: [C: 04-1] "Oh, hmm. For deploying with scap itself. Hmmm. Well we'd do it like we already do with id_rsa/id_rsa.pub in jetty.pp. We should be able to" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [00:10:04] (03Draft1) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:10:06] (03PS2) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:10:50] (03PS6) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [00:14:01] 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3414337 (10Dzahn) [00:18:57] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [00:21:30] (03CR) 10Dzahn: "the production private key has "content => secret('gerrit/id_rsa')," so it comes from private repo. imho the answer is to make a new key a" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [00:22:42] (03CR) 10Dzahn: "the ".pub" part goes in the public repo and the private part in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [00:24:43] (03CR) 10Dzahn: [C: 04-1] "i would say:" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [00:27:27] (03PS3) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:30:22] (03PS7) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [00:34:23] (03CR) 10Chad: WIP: Gerrit: Add support for scap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [00:35:24] (03CR) 10Paladox: WIP: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [00:35:37] (03PS8) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [00:35:53] sorry for alot of spam. [00:40:16] (03PS9) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [00:41:11] (03PS5) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [00:41:12] adds more [00:43:19] (03CR) 10Dzahn: "PS5: dropped "wc" entirely, we can use "exipick -bpc" to count. | added "set -euo pipefail" as response to godog's comment." [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [00:44:26] (03CR) 10Dzahn: "though.. now that " | wc " is gone there is no pipe, so no real reason for "pipefail" either.. shrug" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [01:02:07] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389324 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9195778 keys, up 2 minutes 2 seconds - replication_delay is 1499389324 [01:02:17] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [01:02:17] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389331 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9097506 keys, up 2 minutes 9 seconds - replication_delay is 1499389331 [01:02:17] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:02:17] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1499389333 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9192557 keys, up 2 minutes 12 seconds - replication_delay is 1499389333 [01:02:27] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389344 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9193379 keys, up 2 minutes 22 seconds - replication_delay is 1499389344 [01:03:07] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9188738 keys, up 3 minutes 2 seconds - replication_delay is 0 [01:03:17] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4486963 keys, up 3 minutes 7 seconds - replication_delay is 1 [01:03:17] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4483846 keys, up 3 minutes 7 seconds - replication_delay is 0 [01:03:17] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9190551 keys, up 3 minutes 11 seconds - replication_delay is 0 [01:03:18] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9094321 keys, up 3 minutes 12 seconds - replication_delay is 0 [01:03:27] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9186521 keys, up 3 minutes 21 seconds - replication_delay is 0 [01:17:13] (03PS2) 10Dzahn: grafana: Add legend to dashboard varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle) [01:19:25] (03CR) 10Krinkle: [C: 031] Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [01:19:57] (03CR) 10Dzahn: [C: 032] grafana: Add legend to dashboard varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle) [01:20:33] (03CR) 10Krinkle: [C: 031] "Jobs don't use mwscript afaik. We use the standalone mediawiki/services/jobrunner service (PHP-based), which curls to localhost/rpc/RunJob" [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [01:48:19] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:17] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 80005 bytes in 0.403 second response time [02:59:27] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:58] (03PS6) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [03:10:15] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [03:10:50] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [03:16:25] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [03:27:27] RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:32:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.97 seconds [03:35:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.94 seconds [04:44:37] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1381.90 Read Requests/Sec=403.60 Write Requests/Sec=1.60 KBytes Read/Sec=50345.60 KBytes_Written/Sec=22.00 [04:51:37] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=5.90 Read Requests/Sec=0.40 Write Requests/Sec=0.70 KBytes Read/Sec=2.00 KBytes_Written/Sec=6.80 [05:11:02] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3414481 (10MZMcBride) This task feels very "we should build a gun today and we... [05:31:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:31:07] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:36:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:39:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:13:55] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3414623 (10jcrespo) Probably related: T169884 [06:49:46] !log rebooting bast3002 for kernel update [06:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:07] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:09:10] 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414712 (10MoritzMuehlenhoff) [07:11:57] ACKNOWLEDGEMENT - Host bast3002 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T169959 [07:18:27] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:33:28] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 [07:34:49] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui) [07:35:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 [07:36:17] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui) [07:36:26] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui) [07:37:03] !log marostegui@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [07:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 [07:39:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2056 - T169510 (duration: 00m 43s) [07:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:11] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [07:39:12] 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414712 (10Volans) @MoritzMuehlenhoff the broken disk was known: T169959 IIRC something similar already happened (cannot remember if for this very host or lvs3001) and Faidon was able to make it boot ag... [07:40:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui) [07:41:13] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3414762 (10jcrespo) > I also wonder why some of those log warnings come from close() and others have the proper commitM... [07:41:37] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [07:41:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui) [07:41:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui) [07:41:57] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [600.0] [07:42:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 - T166204 (duration: 00m 42s) [07:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:00] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [07:45:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [07:53:38] 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3414783 (10akosiaris) > Do we still need this Ganglia plugin or should we simply remove it since Ganglia is deprecated? We should remove it. > Do you know if it has worked before and someho... [07:55:53] !log installing libgcrypt security updates [07:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:22] anybody checking the mw exceptions? [07:56:26] just saw the alert [07:56:29] elukey: looking [07:57:01] thanks dcausse ! [08:00:26] dcausse: let me know if you need any help, seems ES related from the stacktrace but I might be wrong [08:00:40] seems related to cirrus, at least I see tons of errors since yesterday 9pm utc [08:05:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [08:11:07] elukey: huge load spike on many nodes, very similar to what we've seen earlier this week (see T169498) [08:11:08] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [08:11:23] it seems to be recovering now [08:11:57] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [08:12:27] okok [08:12:35] Cc: gehel [08:14:39] (03PS1) 10Alexandros Kosiaris: Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246) [08:14:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:14:57] PROBLEM - Check systemd state on bast3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:16:34] jynus / marostegui can we do T167031 ? [08:16:34] T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031 [08:16:44] TabbyCat: give me a sec [08:17:01] claro [08:20:11] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3414823 (10MarcoAurelio) 05stalled>03Open a:03MarcoAurelio [08:20:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:21:00] TabbyCat: can you send me the meta link for the progress? so I can keep it open? [08:21:07] sure [08:21:17] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:20] but I have to trigger the rename first marostegui [08:22:00] when you give me the okay we'll start [08:22:05] 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414827 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Host is back with "disable system services" set to YES in the idrac configuration, https://wikitech.wikimedia.org/wiki/Platform-specific_docu... [08:22:07] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 79923 bytes in 0.289 second response time [08:22:10] TabbyCat: I just want to double check if most of the wikis are on kowiki :) [08:22:16] as specicied on the task [08:23:11] you mean edits? [08:23:17] !log banning elastic1020 and elastic1026 from elasticsearch eqiad cluster [08:23:19] sorry yes [08:23:22] (03PS1) 10Alexandros Kosiaris: Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782 [08:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:10] marostegui: https://meta.wikimedia.org/w/index.php?title=Special:CentralAuth&target=Idh0854 [08:24:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782 (owner: 10Alexandros Kosiaris) [08:25:05] TabbyCat: gracias, that is what I wanted. You can go ahead if you like! [08:25:08] I am ready [08:25:29] okay, give me a sec [08:25:44] TabbyCat: would you !log that or you want me to? [08:26:06] (03PS2) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [08:26:08] I can do that when I start :) [08:26:18] perfecto! :) [08:26:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:27:57] !log Starting global rename of Idh0854 → Garam (T167031) [08:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:07] T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031 [08:29:31] TabbyCat: Send me the meta url once you've got it, so I can check the progress too and check the different shards :-) [08:29:37] marostegui: I got to usurp Garam first if the target consented [08:29:44] gimme a min [08:29:47] sure [08:31:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:32:14] marostegui: in progress now https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Garam [08:32:31] !log installing expat security updates [08:32:41] TabbyCat: gracias! [08:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:37] Sorry to disturb, but I have a small request for assistance. A user is having difficulty logging in over at #wikimedia-tech and has tried two different devices on different networks, incognito mode, and different browsers, and isn't globally locked or IP blocked. [08:35:39] (03PS2) 10Giuseppe Lavagetto: role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985 [08:39:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:39:57] <_joe_> !log disabling puppet across the fleet for enabling directory environments in puppet [08:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:02] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985 (owner: 10Giuseppe Lavagetto) [08:43:16] marostegui TabbyCat: ^ (if you're busy this second, very happy to hang for a bit) [08:43:51] marostegui: you decide, I already did my part :) [08:44:27] TabbyCat: I still see lots of wikis queued [08:44:27] (03PS1) 10Giuseppe Lavagetto: puppetmaster::gitclone: link environments to /etc/puppet [puppet] - 10https://gerrit.wikimedia.org/r/363784 [08:44:45] marostegui: I mean TheDragonFire request above [08:44:56] I cannot decide on what he said [08:45:18] oh wait [08:45:26] But, how is that blocking us? [08:45:31] I am a bit lost :) [08:45:36] so I was [08:45:38] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::gitclone: link environments to /etc/puppet [puppet] - 10https://gerrit.wikimedia.org/r/363784 (owner: 10Giuseppe Lavagetto) [08:45:47] I though it was that role::puppetmaster [08:45:51] lol [08:45:58] I need another cup of coffee [08:46:27] TheDragonFire: I'd say to ask on #wikimedia-ops [08:47:05] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3414879 (10fgiunchedi) 05Open>03Resolved Looks like this is fixed, we don't have poolcounter in beta I think? Anyways if we do we can... [08:48:25] jfsamper: #wikimedia-ops is for IRC ops is it not? [08:48:33] TabbyCat* [08:48:48] yes [08:49:01] maybe he's triggering the login limit the channel has set [08:49:10] an operator can /invite him/her there [08:50:29] TabbyCat: The user is having problems logging onto Wikipedia, not IRC. But someone's just suggested trying login.wikimedia.org so we'll see how that goes. [08:50:39] ah ah [08:50:44] hmm [08:50:51] right [08:51:02] which error message does him receive? [08:51:26] I'm on -tech, will follow-up there [08:52:30] <_joe_> !log restarting apache on all puppetmaster, after a successful puppet run [08:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:04] <_joe_> !log reenabling puppet across the fleet [08:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:48] !log restarting HHVM on app server canaries to pick up libgcrypt and expat updates [08:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:37] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:59:37] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3413499 (10Marostegui) Deleting it just from the database can create inconsistencies. I wouldn't feel too comfortable just issuing a drop database in producti... [09:01:37] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational [09:02:56] (03PS3) 10Filippo Giunchedi: Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) (owner: 10Ppchelko) [09:08:02] (03CR) 10Filippo Giunchedi: icinga/role:mail::mx: add monitoring of exim queue size (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [09:08:25] (03CR) 10Filippo Giunchedi: [C: 032] Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) (owner: 10Ppchelko) [09:11:32] (03PS1) 10Filippo Giunchedi: puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789 [09:11:41] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3414940 (10akosiaris) [09:11:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:11:54] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10akosiaris) [09:13:32] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10akosiaris) 05Open>03Resolved Per @Ladsgroup 's comment we better handle the service implementation in T168073. Which is btw gonna be stalled as we are going to stress test a bit th... [09:15:58] (03PS2) 10Alexandros Kosiaris: Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246) [09:16:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246) (owner: 10Alexandros Kosiaris) [09:16:20] _joe_: https://gerrit.wikimedia.org/r/363789 [09:16:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [09:18:32] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789 (owner: 10Filippo Giunchedi) [09:22:17] PROBLEM - DPKG on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:17] PROBLEM - puppet last run on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:27] PROBLEM - Check systemd state on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:27] those are ok ^ [09:22:48] PROBLEM - DPKG on ores1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:57] PROBLEM - Disk space on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:57] PROBLEM - MD RAID on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:59] TabbyCat: almost there [09:23:07] PROBLEM - DPKG on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:07] PROBLEM - dhclient process on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:07] PROBLEM - salt-minion processes on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:17] PROBLEM - configured eth on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:17] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational [09:23:17] PROBLEM - configured eth on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:17] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [09:23:17] RECOVERY - DPKG on ores1003 is OK: All packages OK [09:23:17] PROBLEM - puppet last run on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:24] marostegui: yep, after kowiki was done it started to go faster [09:23:31] !log schedule a month's worth of downtime for ores100X [09:23:32] I looked 10 minutes ago [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:47] RECOVERY - Disk space on ores1003 is OK: DISK OK [09:23:48] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:23:57] RECOVERY - DPKG on ores1001 is OK: All packages OK [09:23:57] RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient [09:23:57] RECOVERY - salt-minion processes on ores1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:24:03] (03PS2) 10Filippo Giunchedi: puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789 [09:24:07] RECOVERY - configured eth on ores1003 is OK: OK - interfaces up [09:24:07] RECOVERY - configured eth on ores1001 is OK: OK - interfaces up [09:24:07] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [09:24:11] we're on "u" [09:24:47] RECOVERY - DPKG on ores1004 is OK: All packages OK [09:24:59] !log installing NTP security updates on trusty hosts [09:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:03] TabbyCat: finished! [09:29:26] marostegui: good [09:30:14] !log Global rename of Idh0854 → Garam has finished (T167031) [09:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031 [09:31:39] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3414996 (10MarcoAurelio) 05Open>03Resolved Thanks to @marostegui for his help. [09:37:17] !log restarting elastic1036 (corrupted statistics) [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:11] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.39% of data above the critical threshold [1000.0] [09:42:09] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add environment for the future parser [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485) [09:42:14] that's ores btw ^ the too many creates [09:42:58] !log unbanning elastic1020 and 1026 from elasticsearch eqiad [09:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:53] godog: yeah bringing the stresstesting cluster online [09:43:58] I am done actually [09:45:13] akosiaris: *nod* I'm opening a task to regularly purge ores metrics, almost half is a month old [09:45:29] (03PS1) 10Elukey: redis::monitoring::nrpe_instance: set retry_interval to 60s [puppet] - 10https://gerrit.wikimedia.org/r/363791 [09:45:35] (03CR) 10Alexandros Kosiaris: [C: 031] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto) [09:46:31] 10Operations, 10Graphite, 10ORES, 10Scoring-platform-team-Backlog, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10fgiunchedi) [09:47:39] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add environment for the future parser [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto) [09:50:21] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:50:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:53:21] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:54:21] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:54:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] redis::monitoring::nrpe_instance: set retry_interval to 60s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363791 (owner: 10Elukey) [09:59:21] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:59:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:59:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:59:40] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415085 (10jcrespo) Someone announced 60 seconds of downtime, which I do not think is reasonable- rebooting fully a server and all its services takes around 3... [10:00:21] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:02:02] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [10:08:51] PROBLEM - NTP on db1069 is CRITICAL: NTP CRITICAL: Offset unknown [10:09:10] (03CR) 10Gehel: [C: 031] "LGTM (and so much nicer to be able to reuse this QueryBuilder)." [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans) [10:09:11] PROBLEM - NTP on db1026 is CRITICAL: NTP CRITICAL: Offset unknown [10:09:17] 10Operations, 10Graphite, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3415090 (10fgiunchedi) [10:12:01] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:11] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:13:31] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:13:51] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [10:14:01] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:21] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [10:15:49] (03PS2) 10Elukey: redis::monitoring::nrpe_instance: set retry_interval to 2 mins [puppet] - 10https://gerrit.wikimedia.org/r/363791 [10:16:27] <_joe_> uh what's up on thumbor? [10:24:18] (03CR) 10Gehel: Configuration: automatically load backend's aliases (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [10:25:25] godog: I purged it a few times already [10:25:52] 10Operations, 10Graphite, 10User-fgiunchedi: Something puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3415115 (10fgiunchedi) [10:27:41] i.e. action=purge on the main web page, then trying to bypass varnish by adding http parameters and X-wikimedia-debug [10:29:18] also can you check https://phabricator.wikimedia.org/T168002#3377446? the file is deleted ages ago [10:30:48] !log restarting elastic1043 (corrupted statistics) [10:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:11] zhuyifei1999_: not today sorry, they'll eventually fall out of varnish though if they are gone from swift [10:36:30] um okay [10:36:36] can I have an eta? [10:38:51] RECOVERY - NTP on db1069 is OK: NTP OK: Offset -5.36441803e-05 secs [10:39:11] RECOVERY - NTP on db1026 is OK: NTP OK: Offset -9.620189667e-05 secs [10:45:13] (03PS1) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) [10:46:33] (03CR) 10jerkins-bot: [V: 04-1] profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto) [10:48:42] (03PS2) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) [10:53:13] (03PS3) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) [10:55:59] zhuyifei1999_: TTL of varnish is ~7d IIRC [10:56:53] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto) [10:57:03] https://commons.wikimedia.org/wiki/File:Dfsdfsdfsdfsdf.webm is deleted on June 17 [11:00:16] zhuyifei1999_: ok that file was still in swift, I've deleted it [11:00:24] k [11:00:28] thx [11:01:13] zhuyifei1999_: looks like a bug on mw side though, I suggest you add some mediawiki projects too [11:19:54] (03PS2) 10Muehlenhoff: Remove expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/363535 [11:23:45] (03CR) 10Muehlenhoff: [C: 032] Remove expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/363535 (owner: 10Muehlenhoff) [11:28:51] akosiaris: I'd try https://gerrit.wikimedia.org/r/#/c/363791, seems safe enough [11:40:56] !log rebooting rdb* servers in codfw for kernel update [11:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:21] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:20:11] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:12] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:12] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:14] !log restart mysql on dbstore1002 - high swap used [12:20:21] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:22] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:22] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:36] aaand it wasn't downtimed anymore [12:20:41] good job Luca [12:20:41] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:41] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:41] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:41] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:41] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:41] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:41] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:42] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:42] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:43] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:20:51] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:20:52] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:22:21] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [12:22:51] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [12:23:51] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [12:24:22] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [12:27:11] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:11] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:12] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:21] sorry for the noise [12:27:21] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:21] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:22] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:41] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:41] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:41] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:41] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:41] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:41] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:41] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:42] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:42] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:43] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:27:51] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:27:51] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:42:42] (03PS3) 10Giuseppe Lavagetto: Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216 [12:42:44] (03PS4) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [12:42:46] (03PS3) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [12:42:48] (03PS3) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [12:42:50] (03PS1) 10Giuseppe Lavagetto: [WiP] Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 [12:43:33] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (owner: 10Giuseppe Lavagetto) [12:46:46] Hello guys how's everybody [12:47:05] I come from an ISP, currently unable to reach wikipedia.org [12:47:22] Some users are complaining about this isse. Is there anyone around I can check the problem with? THX [12:47:48] XioNoX: --^ [12:47:58] imadoz: Hello again - are you able to perform a traceroute from the network? [12:48:40] Sure [12:49:03] it would also be good to have an address to traceroute back from our network, we could open a phabricator task and set it confidential [12:49:03] I have actually configured a forwarder for all DNS queries destined for wikipedia.org [12:50:08] imadoz: is wikipedia.org resolving at all? [12:50:32] Of course, should I paste the trace here? [12:50:52] imadoz: a pastebin would be preferred [12:51:43] What's the ISP/country? [12:52:17] XioNoX: Lebanon, not sure on ISP [12:52:43] yeah, we got a report yesterday [12:52:58] Oh, routing? :/ [12:53:21] mnets.net? [12:53:50] imadoz: ^? [12:54:19] so far it seems like there is a middleman blocking traffic to Wikipedia's European datacenter from that provider [12:54:45] dropping http sessions, and not letting https establishing [12:54:50] https://pastebin.com/embed_js/eZL8302R [12:55:18] imadoz: which ISP are you from? [12:55:38] ISP is Broadband Plus from Lebanon [12:55:49] You can use the IP address 62.84.80.202 to trace back to us [12:55:58] It is my current NAT IP [12:57:44] imadoz: https://www.ripe.net/membership/indices/data/lb.broadbandplus.html ? Should cedarcom.net serve anything? I get redirected to http://www.mobi.net.lb but it times out [12:58:04] Broaband Plus /Cedarcom /Mobi same company [12:59:01] We are doing some maintenance on our website, and it is currently down [12:59:16] It will be back up in minutes, but this is not related [13:00:11] <_joe_> imadoz: do you know if other providers in Lebanon are having the same issue? [13:00:23] Did not check with other providers [13:00:45] We have a DDOS mitigation service running 24/7 as well, therefore all our traffic is tunneled towards another provider in RUSSIA [13:01:26] I am currently in the proccess of advertising one /24 subnet directly to our TELCO and checking the issue if it gets resolved [13:02:44] let's start with the output of curl -v https://en.wikipedia.org/wiki/Main_Page and "telnet 91.198.174.192 443" [13:02:45] Then: can you edit /etc/hosts and add the line "208.80.153.224 en.wikipedia.org" (without the quotes) and then wait a few minutes and share "curl -v https://en.wikipedia.org/wiki/Main_Page" [13:02:48] imadoz: ^ [13:03:12] to make sure it's the same symptoms [13:05:18] (03PS1) 10Marostegui: db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743) [13:10:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "fwiw, we do use the slaveof command extensively when doing switchovers." [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [13:13:56] imadoz: please do let us know if there's anything that doesn't quite make sense :) it appeared from the above pastebin you may be using a Windows system, and might not have access to the curl command. If you have access to powershell curl is an alias for a similar command and the above syntax will work [13:15:24] Sorry for the delay guys [13:16:24] Not a problem :-) [13:16:32] We are checking with our DDOS mitigation provider if the problem is due to a middleman blocking traffic like you mentioned [13:17:12] While the subnet was advertised towards our TELCO directly, the website worked fine. Therefore, I believe no problem exists from our side or yours [13:17:15] I will keep you posted [13:18:10] Good to hear! [13:27:27] (03PS1) 10Giuseppe Lavagetto: Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826 [13:27:53] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826 (owner: 10Giuseppe Lavagetto) [13:28:31] (03Merged) 10jenkins-bot: Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826 (owner: 10Giuseppe Lavagetto) [13:29:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216 (owner: 10Giuseppe Lavagetto) [13:30:37] (03PS1) 10Marostegui: s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510) [13:31:11] (03PS2) 10Jcrespo: Revert "install_server: Change db1098 MAC address to the one that shows link" [puppet] - 10https://gerrit.wikimedia.org/r/363565 [13:31:24] (03PS5) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [13:31:48] (03CR) 10Alexandros Kosiaris: [C: 032] redis::monitoring::nrpe_instance: set retry_interval to 2 mins [puppet] - 10https://gerrit.wikimedia.org/r/363791 (owner: 10Elukey) [13:33:25] (03CR) 10Marostegui: [C: 032] s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [13:34:12] (03Merged) 10jenkins-bot: s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [13:45:30] (03CR) 10Jcrespo: [C: 032] Revert "install_server: Change db1098 MAC address to the one that shows link" [puppet] - 10https://gerrit.wikimedia.org/r/363565 (owner: 10Jcrespo) [14:04:51] RECOVERY - Check systemd state on bast3002 is OK: OK - running: The system is fully operational [14:05:49] 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3415559 (10herron) 05Open>03Resolved Confirming that the updated logrotate config works. Files have been rotated to .11.gz. -rw-r----- 1 Debian-exim adm 22M Jun 27 06:25 /var/... [14:17:52] (03CR) 10Jcrespo: [C: 032] Update to mariadb 10.1.25, support multi-instance, move unit path [software] - 10https://gerrit.wikimedia.org/r/363327 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:18:20] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3415609 (10elukey) A lot of things changed from my last post, most of them due to the fact that now the apps are not sending any... [14:18:38] (03PS7) 10Jcrespo: mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [14:22:02] @TheresNoTime @XioNoX @_joe_ Thanks guys, problem was resolved successfully! [14:22:22] No worries, thanks for dropping by :-) [14:22:29] @elukey as well [14:22:32] C you guys [14:22:55] thank you! [14:23:23] (03PS8) 10Jcrespo: mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) [14:34:32] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3415654 (10Papaul) [14:35:40] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete , You can take over. Thanks. [14:37:57] (03CR) 10Jcrespo: [C: 032] mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:39:28] (03CR) 10Jcrespo: [C: 04-1] "Let's test on another host first." [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:41:17] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:03] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [14:42:42] PROBLEM - mysqld processes on dbstore2002 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [14:42:55] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:56] icinga lost downtimes again [14:42:56] :( [14:43:00] :-( [14:43:15] PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:43:27] icinga? [14:43:31] ah, as expected [14:43:38] PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [14:43:38] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin [14:43:38] PROBLEM - nutcracker process on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:43:45] PROBLEM - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:43:48] PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [14:43:58] PROBLEM - puppet last run on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:43:58] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR [14:44:07] is the puppet part me? [14:44:18] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:44:18] PROBLEM - salt-minion processes on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:44:18] PROBLEM - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:44:20] (03CR) 10Alexandros Kosiaris: [C: 031] Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 (owner: 10Giuseppe Lavagetto) [14:44:35] PROBLEM - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:44:48] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:44:52] yes, the puppet part is me [14:44:55] PROBLEM - salt-minion processes on ms-fe3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:44:56] PROBLEM - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused [14:45:03] but only on non-jessie hosts, maybe? [14:45:08] PROBLEM - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:45:15] PROBLEM - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused [14:45:28] PROBLEM - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:45:35] PROBLEM - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused [14:45:37] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:45] PROBLEM - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:45:46] PROBLEM - HHVM processes on mw2148 is CRITICAL: NRPE: Command check_hhvm not defined [14:45:47] PROBLEM - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:45:47] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:55] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:56] PROBLEM - HHVM rendering on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused [14:45:57] PROBLEM - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:45:57] PROBLEM - salt-minion processes on ms-fe3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:46:05] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:05] PROBLEM - cassandra-a service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:46:05] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:09] gah, I'll fixup ms-fe3 [14:46:15] (03CR) 10Alexandros Kosiaris: [C: 031] Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 (owner: 10Giuseppe Lavagetto) [14:46:16] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:17] PROBLEM - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [14:46:18] (03PS3) 10Filippo Giunchedi: Recommendation API: Add the beta scap source [puppet] - 10https://gerrit.wikimedia.org/r/360686 (https://phabricator.wikimedia.org/T165760) (owner: 10Mobrovac) [14:46:19] PROBLEM - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.118 and port 9042: Connection refused [14:46:35] PROBLEM - Nginx local proxy to apache on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 443: Connection refused [14:46:35] PROBLEM - Check systemd state on ms-fe3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:35] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:45] PROBLEM - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:46:46] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused [14:46:55] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:55] PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [14:46:55] PROBLEM - cassandra-b service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:05] PROBLEM - puppetmaster https on labtestpuppetmaster2001 is CRITICAL: connect to address 208.80.153.108 and port 8140: Connection refused [14:47:15] PROBLEM - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:15] ACKNOWLEDGEMENT - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff T168613 [14:47:15] ACKNOWLEDGEMENT - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:15] ACKNOWLEDGEMENT - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:15] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:15] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:15] ACKNOWLEDGEMENT - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:16] ACKNOWLEDGEMENT - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:16] ACKNOWLEDGEMENT - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:17] ACKNOWLEDGEMENT - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff T168613 [14:47:17] ACKNOWLEDGEMENT - IPMI Temperature on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613 [14:47:18] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused Muehlenhoff T168613 [14:47:35] PROBLEM - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:35] PROBLEM - dhclient process on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:45] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [14:47:45] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin [14:47:45] PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 [14:47:46] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169993 [14:47:47] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:47:47] PROBLEM - mediawiki-installation DSH group on mw2148 is CRITICAL: Host mw2148 is not in mediawiki-installation dsh group [14:47:47] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:47] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:47:50] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T169993#3415683 (10ops-monitoring-bot) [14:48:05] PROBLEM - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [14:48:05] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin [14:48:05] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:48:06] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:48:15] PROBLEM - HTTPS-eventdonations on eventdonations.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Name or service not known [14:48:15] PROBLEM - Apache HTTP on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused [14:48:15] PROBLEM - kartotherian endpoints health on maps-test2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [14:48:15] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin [14:48:16] PROBLEM - nutcracker port on mw1228 is CRITICAL: Return code of 255 is out of bounds [14:48:16] PROBLEM - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:48:16] PROBLEM - salt-minion processes on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [14:48:17] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [14:48:35] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:48:53] ACKNOWLEDGEMENT - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:53] ACKNOWLEDGEMENT - IPMI Temperature on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:54] ACKNOWLEDGEMENT - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:54] ACKNOWLEDGEMENT - MegaRAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [14:48:55] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused Muehlenhoff T169696 [14:48:55] ACKNOWLEDGEMENT - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused Muehlenhoff T169696 [14:48:56] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused Muehlenhoff T169696 [14:49:25] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:50:05] RECOVERY - salt-minion processes on ms-fe3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:50:05] RECOVERY - salt-minion processes on ms-fe3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:50:14] 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3415687 (10Dzahn) a:03Dzahn thanks @akosiaris gotcha! [14:50:16] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:50:35] RECOVERY - Check systemd state on ms-fe3002 is OK: OK - running: The system is fully operational [14:51:05] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:51:35] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 [14:51:47] (03CR) 10Filippo Giunchedi: [C: 032] Recommendation API: Add the beta scap source [puppet] - 10https://gerrit.wikimedia.org/r/360686 (https://phabricator.wikimedia.org/T165760) (owner: 10Mobrovac) [14:52:25] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl... [14:53:14] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl... [14:53:15] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:31] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Ple... [14:53:38] How is that page possible, the check is silenced [14:54:00] Ah, right, it is an old one [14:55:18] downtimes do not avoid pages [14:55:29] I do not know why people continue thinking they do [14:58:05] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:25] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:14] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3387489 (10hashar) >>! In T169114#3414879, @fgiunchedi wrote: > Looks like this is fixed, we don't have poolcounter in beta I think? Anywa... [14:59:15] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:55] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:35] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:35] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:36] !log deleting commonswiki_file_1499379383 on elastic@eqiad (failed reindex) [15:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:25] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:35] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:35] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:55] (03PS2) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) [15:01:57] (03PS1) 10Jcrespo: Fix service for hosts with a default package (fwup. f13be9f5a2949f) [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) [15:02:27] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:43] jynus: I was going to upgrade db1102 to 10.1, you want to take it for your tests? [15:02:46] It is not urgent [15:02:48] (03PS2) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) [15:03:05] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:10] I was going to use another host [15:03:15] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:15] jynus: Ah cool then :) [15:03:28] are you going to upgrade it anyway? [15:03:39] yeah to make it like db1095 [15:03:51] oh [15:03:59] you mean mariadb [15:04:03] sorry, yes :) [15:04:07] I thought you meant the os [15:04:15] no no, jessie + 10.1 [15:04:27] do as it is best for you [15:04:34] no opinion there [15:04:35] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:04:41] cool, I will do it then if you are not going to use it for your tests [15:04:45] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:47] thank you! [15:05:46] (03PS3) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) [15:06:35] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:08] !log Stop MySQL on db1102 for MariaDB upgrade [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:29] (03PS4) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) [15:09:05] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:25] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:33] (03CR) 10Jcrespo: [C: 032] mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:09:35] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:45] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:05] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:25] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:12:04] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415739 (10Halfak) Announcements have been updated. Thanks for the note. Shall we always announce a 1 hour maintenance window for DB maintenance? [15:13:23] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3415741 (10hashar) a:05hashar>03None **Status update** There are a few patches for puppet.git that are... [15:15:16] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415746 (10jcrespo) It varies from maintanance to maintenance, depending on the work to be done. Some take more some take less- the "normally" was meant as "N... [15:16:07] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3415747 (10herron) Received an alert today via the email to sms gateway. Is this the expected behavior, or should the alert have been sent directly via SMS? [15:16:26] 10Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3415748 (10herron) [15:17:03] (03PS1) 10Rush: labtest: new public servers add base + firewall [puppet] - 10https://gerrit.wikimedia.org/r/363843 (https://phabricator.wikimedia.org/T168893) [15:22:14] (03CR) 10Rush: [C: 032] labtest: new public servers add base + firewall [puppet] - 10https://gerrit.wikimedia.org/r/363843 (https://phabricator.wikimedia.org/T168893) (owner: 10Rush) [15:22:21] RECOVERY - mysqld processes on db1102 is OK: PROCS OK: 7 processes with command name mysqld [15:24:04] 7? [15:24:23] <_joe_> melius abundare quam deficiere [15:24:29] lol [15:26:00] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3413824 (10mobrovac) [15:26:35] PROBLEM - Check HHVM threads for leakage on mw2148 is CRITICAL: NRPE: Command check_check_leaked_hhvm_threads not defined [15:27:03] It reminds me of the famous quote: "Don't create too many checks on icinga, or you may regret it in the future" --abraham lincoln [15:27:15] jynus: abe said that? [15:27:20] jynus: what a smart man [15:30:35] "Nearly all men can stand adversity, but if you want to test a man's character, give him root access." [15:30:52] (03PS1) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) [15:31:05] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:32:04] four score and seven alerts ago [15:32:22] (03PS3) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) [15:32:24] (03PS2) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) [15:34:30] (03CR) 10Jcrespo: [C: 032] mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:34:38] (03PS3) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) [15:35:42] 10Operations, 10Wikimedia-Stream: rcstream service - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#3415801 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information. [15:35:44] 10Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#3415805 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information. [15:35:46] 10Operations, 10Wikimedia-Stream: Error on RCStream server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#3415809 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information. [15:39:47] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415818 (10Halfak) [15:40:55] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak) [15:41:04] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak) Gotcha. Next time, we should add these details to the task description and I'll pick them up from there when making announcement. :) In... [15:51:40] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:52:10] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:52:29] (03CR) 10Muehlenhoff: [C: 031] Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [15:56:00] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:56:40] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:57:41] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:58:30] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:59:10] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:59:11] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:00:00] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:00:10] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:00:20] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:01:00] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:01:30] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:01:40] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:03:20] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:05:00] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:07:30] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:07:40] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:08:20] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:08:30] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:08:31] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:08:40] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:10:30] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:13:50] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:14:50] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:15:30] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:17:51] (03PS1) 10Milimetric: Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) [16:32:39] (03Draft1) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847 [16:32:41] (03PS2) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847 [17:02:53] !bash < jynus> It reminds me of the famous quote: "Don't create too many checks on icinga, or you may regret it in the future" --abraham lincoln [17:02:53] bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AV0eARQxU4b8yJAIAfBE [17:03:51] lol [17:11:51] spoiler- the quote is actually fake [17:11:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3416029 (10faidon) FTR, as I mentioned on IRC, these three changes are continuing down the path of accumula... [17:17:35] (03CR) 10C. Scott Ananian: [C: 031] "I don't have C+2 rights to puppet. (I used to, though?)" [puppet] - 10https://gerrit.wikimedia.org/r/363045 (owner: 10Mobrovac) [17:24:49] (03CR) 10Mforns: [C: 031] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric) [17:25:08] (03CR) 10Jcrespo: "I think that should get it from being a dbstore- we have analytics nodes, dbstores, and dbstores that are also analytics nodes. I will che" [puppet] - 10https://gerrit.wikimedia.org/r/356648 (owner: 10Jcrespo) [17:25:13] (03PS1) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [17:38:10] (03PS4) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [17:43:38] (03PS10) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:44:06] (03PS11) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:46:31] (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [17:53:00] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3416130 (10chasemp) a:05Cmjohnson>03chasemp I'll try to take care of this in the am mon or tue [18:13:20] (03PS1) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 [18:14:19] needs a bot that creates phab tickets when a wiki page changes [18:14:32] but only that one section on the page :p [18:15:22] (03CR) 10Paladox: [C: 031] planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 (owner: 10Dzahn) [18:17:05] (03CR) 10Dzahn: [C: 032] planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 (owner: 10Dzahn) [18:17:51] (03PS2) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 [18:18:07] (03PS1) 10BryanDavis: toolforge: Remove SCAN redis command [puppet] - 10https://gerrit.wikimedia.org/r/363858 (https://phabricator.wikimedia.org/T169957) [18:18:15] chasemp: ^ [18:18:17] (03PS3) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 [18:19:08] (03CR) 10Rush: [V: 032 C: 032] toolforge: Remove SCAN redis command [puppet] - 10https://gerrit.wikimedia.org/r/363858 (https://phabricator.wikimedia.org/T169957) (owner: 10BryanDavis) [18:19:30] bd808: done [18:22:19] (03PS4) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 [18:23:17] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3416411 (10eross) @Dzahn From the ticket on Zendesk it stated problemsdonate@ but I can change if it needs to be problemsdonating@ however, problemsdonating@ is already created as an u... [19:09:39] !summon wikibugs [19:14:08] mutante: looks like it needs a restart [19:14:13] [20:29:05] * wikibugs (tools.wiki@wikimedia/bot/pywikibugs) quit (Ping timeout: 260 seconds) [19:14:42] You know what would be cool? A bot restarting bot [19:14:54] Who restarts the bot restarting bot? [19:15:10] that is what i meant to imply :) [19:15:16] by making up that command, heh [19:15:28] Hehehe [19:16:50] maybe i will do it some day when i get to https://gerrit.wikimedia.org/r/#/c/320698/ :) [19:16:53] mutante: Where did we leave off on finishing moving releases.wm.o off bromine? [19:17:08] since eggdrop = rock stable , hehehe [19:17:30] RainbowSprinkles: last thing i did was make the rsync work and confirmed the files were there [19:17:39] we didnt switch it yet because of the upload part [19:17:41] That's what I thought [19:17:48] Yeah, that's the remaining bit, upload + dns [19:17:59] if it wasnt for the upload, i would change DNS now [19:18:12] but that part was always so tricky and i had like at least 2 long debug sessions with subbu [19:18:16] on the _existing_ setup :p [19:18:21] * RainbowSprinkles nods [20:17:29] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3416713 (10Krinkle) >>! In T102178#3402913, @Krinkle wrote: > @GWicke At which point was wikimedia.org (or www.wikimedia.org?) a wiki? Assumin... [20:17:43] 10Operations, 10Cloud-Services, 10RESTBase, 10Services, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1358396 (10Krinkle) >>! In T102178#3403100, @GWicke wrote: > @krinkle, your comment sounds like it might have been intended for {T133178}. Indee... [20:32:05] greg-g: is it OK if I backport the patch for https://phabricator.wikimedia.org/T169261 ? [20:33:17] (03CR) 10ZZhou (WMF): "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/363884 (owner: 10Dzahn) [20:39:37] hi, is there a dns problem? [20:39:46] pinging has started failing for labs instances [20:40:32] things are recoverying now. [20:46:17] (03PS7) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 [20:47:34] 10Operations, 10Services (done): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3416834 (10GWicke) See also https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Alerts_.28with_notifications_via_Icinga.29 for some documentation on the topic by @halfak and myself. [20:48:10] (03CR) 10Paladox: "> still per my comment on PS2, the pub part goes in the public repo," [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [20:48:32] (03CR) 10Paladox: "Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:02:47] (03CR) 10Dzahn: ">What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:02:58] (03PS8) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 [21:03:43] (03CR) 10Dzahn: ">Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:03:47] (03CR) 10Paladox: "> >What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:04:47] (03CR) 10Dzahn: ">storing both keys under secret" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:05:09] legoktm: sorry for the late response, but yes [21:11:54] 10Operations, 10MediaWiki-JobQueue, 10monitoring: Establish monitoring thresholds for job queue - https://phabricator.wikimedia.org/T79687#3416884 (10Krinkle) [21:11:57] 10Operations, 10MediaWiki-JobQueue, 10monitoring, 10Patch-For-Review: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#3416882 (10Krinkle) [21:12:02] Hello, Wikimedia AI team is wondering how we would send celery logs and events to logstash on prod and labs [21:12:18] 10Operations, 10MediaWiki-JobQueue, 10monitoring: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#2224324 (10Krinkle) [21:12:35] (03PS1) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) [21:17:59] (03CR) 10jerkins-bot: [V: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [21:21:39] (03CR) 10BryanDavis: openstack: add wikitech-grep as utility for adminscripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [21:22:18] (03CR) 10Dzahn: [C: 031] use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [21:26:54] (03CR) 10Chad: [C: 04-1] "Forking mwgrep is a terrible idea." [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [21:27:47] (03PS2) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) [21:28:09] (03CR) 10Chad: [C: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [21:28:13] (03Abandoned) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 (owner: 10Paladox) [21:28:14] RainbowSprinkles: can you help me understand? Forking mwgrep is a terrible idea. [21:28:18] honestly I just don't know [21:28:29] Why wouldn't you just add like extra options to the original mwgrep? [21:28:54] (03CR) 10jerkins-bot: [V: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [21:29:08] jernkins-bot agrees, but probably for different reasons :p [21:29:10] I'm not sure why bryan did it this way, maybe to cut down on clutter for a specific use case [21:29:20] Silly. [21:29:37] bd808: do you care to pursue either wikitech-grep or modifying mwgrep? [21:29:39] Like, if you want to limit mwgrep to one (or more) wikis, just add like a --wiki parameter [21:29:58] I mean, it's a utility taht no one is asking you to use and I don't feel that strongly about it honestly [21:30:02] but we are going to keep using it I think [21:30:08] If you want to use different namespaces, expand the ability from --user and such to be more namespace-agnostic [21:30:09] it just won't be inpuppet [21:30:44] sigh [21:30:51] I mean if you want to use unpuppetized stuff that's on you, I just think forking within puppet is a bad idea. [21:30:53] my point is the outcome of an in-use thing being in puppet and findable so that could happen is better than the current situation [21:30:56] this is why it was just in ~bd808 [21:31:08] ok fair, I'll abandon and keep it on the sly [21:31:15] making mwgrep better would be great [21:31:21] it seems like a silly argument [21:31:47] getting rid of mwgrep and having a cirrussearch replica cluster would be better [21:32:26] Talk to discovery and cloud services about that. [21:32:33] arguing about a 100 line python wrapper around the elasticsearch api is not worth anyone's time [21:32:34] (reposting question as it probably got lost in scrollback) Does anyone know how Wiki-ai can log celery logs and events to logstash? [21:32:51] I haven't used mwgrep so I'm not sure about it [21:32:53] RainbowSprinkles: it was in my hardware ask. it just didn't make the final cut [21:33:01] bummer :( [21:33:11] we will get it done [21:33:16] RainbowSprinkles: is this a thing you feel like standing on? the -1 [21:33:36] I've lived a productive life, I guess I found my hill to die on? [21:33:41] jk. Tempest/teapot [21:33:41] you have thought about it more than I, I haven't used mwgrep in all honesty only this shel of a util [21:33:44] I just think it's dumb af [21:33:53] When you could add like a --wiki parameter to mwgrep [21:33:57] that's fair, then make the changes you are requesting? [21:34:24] I guess...it's cool you object honestly but why you care if we use this I don't understand [21:34:47] it has no impact on anything or anyone [21:35:10] Because the existing tool works too :) [21:35:15] 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416966 (10Zppix) IIRC ops have to run a script to upsate the interwiki map therefore adding the tag [21:35:30] can I use it to search wikitech for instances of the string 'bigbrother'? [21:35:37] honest question as I don't know [21:35:40] and that's my most recent use case [21:35:49] no, its jsut for looking for js stuff [21:35:56] It can look at other things [21:36:03] --module searches NS_MODULE [21:36:13] But the code is versatile enough that a small tweak could make it general-purpose namespace [21:36:25] sure but no one is going to do it or wants to [21:36:56] FWIW legoktm is chasting me in other places for making a cli tool at all [21:37:23] it seems odd to me people care, what difference does it make to anyone else? [21:38:47] jenkins does hate it as well [21:40:24] * RainbowSprinkles removes his -1, considers his objection heard [21:41:07] 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3415305 (10Dzahn) Do you mean running a script on a maintenance server? There are > 55 deployers who can do that too (and often do), not just the few ops. [21:42:00] RainbowSprinkles: I think I don't know enough history about mwgrep to know if this is a violation of standing arguments [21:42:40] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3416972 (10Dzahn) @eross Thank you Emerauld! Appreciate it. I will remove it on our side and close this ticket here. Yes, the rest can just be up to James A and Fundraising. [21:42:53] mutante I typed that on mobile (my desktop wont connect due to internet on my end so thats why i summed it up) [21:43:49] chasemp: It's not about the history of mwgrep. [21:43:58] It's just that I think forking code is usually lame [21:44:36] consdering I haven't used mwgrep I'm behind the curve [21:45:23] forking is almost never the answer imho. It's one of the things I dislike about Github [21:45:34] [21:45:47] * RainbowSprinkles steps out for some air [21:46:04] 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416974 (10Zppix) >>! In T169979#3416969, @Dzahn wrote: > Do you mean running a script on a maintenance server? If that is possible to do sometime, yes if i need to be there let me know and i will sho... [21:47:57] (03PS1) 10Chad: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) [21:48:07] (03CR) 10Chad: [C: 032] Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad) [21:49:42] (03Merged) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad) [21:49:52] (03CR) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad) [21:50:14] (03CR) 10Hashar: [C: 031] "Krinkle: correct. Though some jobs invoke SiteConfiguration::getConfig() which ends up shelling out :\ So indirectly jobs do rely on mwsc" [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [21:52:39] !log demon@tin Synchronized wmf-config/interwiki.php: Updating interwiki cache, T169979 (duration: 00m 43s) [21:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:53] T169979: Update to interwiki map - https://phabricator.wikimedia.org/T169979 [21:53:03] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416997 (10Zppix) a:03demon [21:53:37] Zppix: Please don't assign tasks to me next [21:53:39] time [21:53:51] k [21:54:44] !log legoktm@tin Synchronized php-1.30.0-wmf.7/extensions/CentralAuth/: Fix handling of password hash upgrade on login - T169261 (duration: 00m 45s) [21:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:55] T169261: Users unable to remain logged in, associated with attempts to upgrade the password hash on every login - https://phabricator.wikimedia.org/T169261 [22:26:55] 10Operations, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#3417080 (10Krinkle) [22:36:34] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#3417115 (10Dzahn) [22:36:36] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3417112 (10Dzahn) 05Open>03Resolved a:03Dzahn Removed on ops side. I see that problems.donating , problemdonating and problem.donating and comentarios work in Google. The other... [22:40:21] (03PS1) 10Chad: WIP: Simple wrapper around updating the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970 [23:39:42] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3417355 (10greg) >>! In T144006#2689359, @hashar wrote: > What is left is deployment-tmh01 which needs some packaging work for Jessie as I understood it. That was Oct 2016 :... [23:40:10] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3417359 (10greg) [23:40:12] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10greg)