[00:22:31] (03CR) 10BearND: [C: 031] Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza) [00:37:42] (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [00:46:32] (03PS3) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385) [01:13:47] (03CR) 10VolkerE: [C: 031] "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle) [01:22:24] PROBLEM - HHVM rendering on mw2135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:15] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 73911 bytes in 0.313 second response time [01:25:14] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [01:33:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:34:15] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [01:58:42] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [01:59:23] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [02:09:41] 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692629 (10bd808) [02:09:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [02:22:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [02:22:12] 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692639 (10bd808) [02:25:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [02:33:34] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.3) (duration: 08m 21s) [02:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:02] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [02:51:22] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [02:51:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [03:07:12] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3692673 (10dbarratt) Thanks @Catrope for getting me access to the file. Here's how to get everyone... [03:09:58] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 14m 58s) [03:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:11] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3692690 (10dbarratt) To be fair... just skimming the values, there are a lot that are valid, but th... [03:17:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 18 03:17:01 UTC 2017 (duration 7m 3s) [03:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:52] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:21:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:27:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.10 seconds [03:33:02] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [03:33:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [03:33:22] 10Operations, 10Traffic: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3692711 (10BBlack) [03:33:24] 10Operations, 10Traffic, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692714 (10BBlack) [03:33:27] 10Operations, 10Phabricator, 10Traffic, 10procurement, 10HTTPS: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692715 (10BBlack) [03:46:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:46:12] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:59:32] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:23] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74341 bytes in 0.312 second response time [04:25:24] PROBLEM - Apache HTTP on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:13] RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.118 second response time [04:28:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 185.06 seconds [05:20:58] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3692787 (10Marostegui) The ALTER tables finished correctly and no more crashes happened. Let's change the memory anyways @Cmjohnson [05:22:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 [05:22:20] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 [05:24:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui) [05:25:19] !log Optimize pagelinks and templatelinks on db1102 for s2 - T174509 [05:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:27] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:26:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui) [05:26:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui) [05:27:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T174509 (duration: 00m 50s) [05:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) [05:33:36] (03PS3) 10Marostegui: install_server: Reinstall db2084 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359) [05:34:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:36:08] (03CR) 10Marostegui: [C: 032] install_server: Reinstall db2084 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [05:36:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:36:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:36:42] !log Optimize pagelinks and templatelinks on db1098 - T174509 [05:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:48] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:37:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 - T174509 (duration: 00m 49s) [05:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) [05:40:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:40:59] !log Optimize pagelinks and templatelinks on db1055 - T174509 [05:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:42:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:43:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T174509 (duration: 00m 50s) [05:43:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 [05:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:58] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:44:00] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 [05:45:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui) [05:46:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui) [05:48:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 49s) [05:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:13] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [05:48:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui) [06:03:43] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.160 second response time [06:04:42] PROBLEM - Blazegraph Port on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:43] PROBLEM - Blazegraph process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:43] PROBLEM - Updater process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:53] PROBLEM - SSH on wdqs1003 is CRITICAL: Server answer [06:05:02] PROBLEM - configured eth on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:12] PROBLEM - Disk space on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:13] PROBLEM - Check size of conntrack table on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:23] PROBLEM - dhclient process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:23] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:33] PROBLEM - DPKG on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:33] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:05:43] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:06:03] PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:09:14] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I would rather have the process of syncing facts be a complex process where you run a program that cleans up any potentially sensible data" [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [06:25:22] PROBLEM - Check the NTP synchronisation status of timesyncd on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:27:11] can't ssh into that :/ [06:27:42] PROBLEM - IPMI Sensor Status on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:30:13] Any able to help? I can't ssh into it, so can't do anything to wdsq1003 [06:30:14] not even depool it [06:30:52] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.590 second response time [06:33:31] _joe_: ^ [06:33:47] <_joe_> hoo: I am trying [06:34:22] <_joe_> hoo: I can depool it, but pybal should depool it automatically [06:34:33] well, it didn't :/ [06:35:07] <_joe_> ok so that's a problem of how the check is done I think [06:35:23] <_joe_> ema: ^^ can you take a look at the pybal logs while I kick the machine? [06:37:19] _joe_: sure [06:38:29] <_joe_> so I'm in the console, it seems an OOM [06:38:48] not much of a surprise… gehel is doing GC testing right now [06:39:16] there's no pybal log on lvs1003 re: wdqs1003 [06:39:28] $ curl http://localhost:9090/pools/wdqs_80 [06:39:28] wdqs1003.eqiad.wmnet: enabled/up/pooled [06:39:28] wdqs1005.eqiad.wmnet: enabled/up/pooled [06:39:28] wdqs1004.eqiad.wmnet: enabled/up/pooled [06:39:53] and indeed HTTP requests to wdqs1003:80 work fine [06:40:47] that's the reverse proxy in front of the actual service [06:42:05] <_joe_> ema: what url does get checked there? [06:42:50] _joe_: proxyfetch.url = ["http://localhost/"] [06:43:15] <_joe_> ok, meh, that's the reason [06:43:42] it should check both that and port 9999 (where the actual service is running on) [06:43:45] and I get stuff like [...] Wikidata Query Service [06:44:16] <_joe_> hoo: no you can't check port 9999 [06:44:26] <_joe_> you should check some url that is meaningful [06:44:35] <_joe_> and shows if the service is up and functioning [06:44:58] It could probably run a SELECT "1"-like dummy query cheaply [06:45:16] FTR there's this dashboard you can check to see what pybal thinks of its services: https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=All [06:46:17] <_joe_> !log powercycling wdqs1003 [06:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:52] PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [06:47:54] hoo: what's the GET request producing that query? [06:48:32] RECOVERY - dhclient process on wdqs1003 is OK: PROCS OK: 0 processes with command name dhclient [06:48:32] RECOVERY - Check size of conntrack table on wdqs1003 is OK: OK: nf_conntrack is 0 % full [06:48:33] RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1003 is OK: OK ferm input default policy is set [06:48:42] RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:48:42] RECOVERY - DPKG on wdqs1003 is OK: All packages OK [06:48:43] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [06:48:58] <_joe_> we could even add a /liveness url to the nginx frontend that does the query for us [06:49:02] RECOVERY - SSH on wdqs1003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [06:49:03] RECOVERY - configured eth on wdqs1003 is OK: OK - interfaces up [06:49:09] <_joe_> pretty much like probes work in kubernetes [06:49:12] RECOVERY - WDQS HTTP Port on wdqs1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [06:49:19] I'm still dropping of Oscar at daycare. _joe_ , hoo , thanks for taking care of wdqs1003! [06:49:22] RECOVERY - Disk space on wdqs1003 is OK: DISK OK [06:50:25] <_joe_> gehel: you'll be needed for the followup, don't worry ;) [06:50:35] BTW, GC tuning was done on wdqs1004, so the cause is something else... [06:50:43] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:51:15] <_joe_> gehel: ping me when you're back, there is no hurry now [06:53:24] _joe_: will do [06:55:22] RECOVERY - Check the NTP synchronisation status of timesyncd on wdqs1003 is OK: OK: synced at Wed 2017-10-18 06:55:14 UTC. [06:57:42] RECOVERY - IPMI Sensor Status on wdqs1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [07:02:29] !log Stop MySQL on db2079 to copy its data to db2084 - T178359 [07:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:39] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:07:56] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [07:08:11] hmm this is weird. wdqs1003 should have tons of memory and java heap is way below it [07:08:56] unfortunately I can't access logs :( [07:10:26] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/792542d039b7e66780e8b3e9a00fd5e2ab986252b8ac1f0d74719d45be63cb66/merged is not accessible: Permission denied [07:10:52] but it looks it was in trouble since 21:00 judging by graphs, and somehow icinga alerted only at 23:00 [07:11:56] I think we should depool wdqs1003 for now, it's 9 hrs behind, not really good to serve queries from it [07:12:45] _joe_: could you depool it and re-pool when it catches up? you could see the lag at https://grafana.wikimedia.org/dashboard/db/wikidata-query-service [07:13:20] * gehel is back [07:13:31] SMalyshev, _joe_: I'll depool wdqs1003 [07:13:42] ok, cool [07:13:56] and probably worth looking in the logs to see what happened there.... [07:14:04] !log depooling wdqs1003 until it catches up on updates [07:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:39] <_joe_> gehel: quite a few things to fix here [07:18:55] _joe_: yep. First LVS check [07:19:20] <_joe_> gehel: yeah so what we need is to have a url we can call from the load-balancer [07:19:32] RECOVERY - Disk space on contint1001 is OK: DISK OK [07:19:33] <_joe_> that tells us if wdqs is up and has an acceptable lag [07:19:38] _joe_: that one should be easy [07:19:57] <_joe_> so I'd love to have a "liveness" check and a "readyness" check [07:20:14] <_joe_> pretty much like kubernetes does [07:21:29] _joe_: not sure I understand... [07:22:13] gehel: I see a lot of throttling messages in the log... looks like somebody was really spamming the server... and throttling didn't help [07:22:31] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/477586f8c192c241d251dcda6d0d050581e5e075b25bf945435a8cb391d8e64d/merged is not accessible: Permission denied [07:28:46] (03PS3) 10Hoo man: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) [07:28:48] (03PS2) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) [07:31:37] <_joe_> gehel: something like [07:32:03] <_joe_> one url we can call that tells us "wdqs can perform basic queries and has an acceptable lag" [07:32:14] <_joe_> by responding 200 [07:32:31] <_joe_> and some other http code if that's not the case [07:32:39] <_joe_> that url could be checked by pybal [07:35:08] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692887 (10mobrovac) The 429 coming from WDQS. @Gehel, @Smalyshev would it be possible to split WDQS' rate limiting for internal and exter... [07:37:11] is there an ongoing known issue with WDQS? [07:38:55] <_joe_> mobrovac: read backlog [07:39:02] <_joe_> mobrovac: whay are you asking? [07:39:28] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692895 (10Smalyshev) @mborovac what's the rate the requests are currently sent at? IIRC the limits we have are pretty generous, but depen... [07:39:31] <_joe_> oh our beloved SOA with services failing in sequence [07:39:32] because the recommendation api service was flapping _joe_ [07:39:45] <_joe_> I love that so much [07:40:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 [07:40:16] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 [07:40:26] 10Operations, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3692896 (10hashar) For a running Docker container we have: ``` overlay on /v... [07:40:28] <_joe_> mobrovac: because instead of creating microservices, we create filters, but that's a longer discussion ofc [07:40:32] RECOVERY - Disk space on contint1001 is OK: DISK OK [07:40:35] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692899 (10Gehel) @mobrovac it is possible if we can identify internal traffic. The throttling we apply is bucketed by user agent / IP, so... [07:40:57] huh reading the backlog is a bit hard with all these notifications [07:41:11] the contint1001 disk space alarm is triggered whenever a container run on the server :/ So most probably it can be ignored for now and I have filled T178454 [07:41:13] T178454: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454 [07:41:34] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui) [07:41:54] <_joe_> hashar: which storage driver are you using? [07:42:27] <_joe_> I would strongly advise against using the loopback devicemapper [07:42:35] <_joe_> that's what is causing trouble [07:42:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui) [07:43:07] (03PS1) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 [07:43:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui) [07:43:16] _joe_: hello; I have no idea about the storage driver. I guess we just installed docker-ce package for now and havent tweaked any settings [07:44:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 - T174509 (duration: 00m 59s) [07:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:16] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:44:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) [07:46:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:46:53] !log Optimize pagelinks and templatelinks on db1056 - T174509 [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:49:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T174509 (duration: 00m 49s) [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:57] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:53:34] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692915 (10mobrovac) >>! In T178445#3692895, @Smalyshev wrote: > @mobrovac what's the rate the requests are currently sent at? IIRC the li... [07:55:27] (03CR) 10Muehlenhoff: "Also reported in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878920" [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [08:00:05] hoo: #bothumor My software never has bugs. It just develops random features. Rise for Usage tracking. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T0800). [08:00:05] No GERRIT patches in the queue for this window AFAICS. [08:01:15] (03PS1) 10Muehlenhoff: Remove access for gwicke [puppet] - 10https://gerrit.wikimedia.org/r/384944 [08:01:20] (03PS1) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 [08:01:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [08:02:38] mobrovac: does recommendation service use specific user agent? I can't locate those 429s in the logs... [08:04:42] SMalyshev: it sets the user agent to "Recommendation API (Wikimedia tool; learn more at https://meta.wikimedia.org/wiki/Recommendation_API)" [08:05:40] (03CR) 10Hoo man: [C: 032] Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man) [08:07:15] (03Merged) 10jenkins-bot: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man) [08:07:15] mobrovac: I just found it in https://logstash.wikimedia.org/goto/37bc20a91c05ad1c781500fcc806236f [08:07:25] (03CR) 10jenkins-bot: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man) [08:09:27] gehel: 90 reqs per 30 minutes sould about the right rate [08:09:34] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable description usage tracking on a few test wikis (T177155) (duration: 00m 50s) [08:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:43] T177155: Find test wiki(s) for new description usage and enable there - https://phabricator.wikimedia.org/T177155 [08:10:43] mobrovac: yes, I can see them now...but I see also a bunch of requests from the same IP without user agent [08:11:40] SMalyshev: these are the scb nodes, which have ~10 services, so that might be any other of them (even though I'm not sure there should be any reqs without the UA) [08:12:56] ah no I was looking at it wrong [08:13:30] now looking per user agent I see about 2 req/s with this UA [08:13:43] 2 reqs/s ? [08:13:55] Not a single new description usage in 4 minutes… that's even more low profile than anticipated [08:14:17] (03CR) 10Muehlenhoff: [C: 032] Remove access for gwicke [puppet] - 10https://gerrit.wikimedia.org/r/384944 (owner: 10Muehlenhoff) [08:14:32] in some seconds it's 5 reqs... my kibana-fu not strong enough to make proper calculation though [08:14:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:14:59] that is highly unusual [08:15:27] (03PS1) 10Marostegui: core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946 [08:16:22] they seem to be all 200s though (though I suspect we somehow fail to log 429s...) anyway, it's late here, I'll check more into it tomorrow [08:16:37] thnx SMalyshev [08:17:15] (03PS2) 10Marostegui: core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946 [08:17:16] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692940 (10Gehel) Looking at [[ https://logstash.wikimedia.org/goto/37bc20a91c05ad1c781500fcc806236f | logs in logstash ]], it seems we th... [08:17:20] (03CR) 10Marostegui: [C: 032] core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946 (owner: 10Marostegui) [08:17:54] (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema) [08:22:21] !log mobrovac@tin Started deploy [changeprop/deploy@065a06e]: Bug fix: Apply ignore_errors on a per-request basis [08:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:39] !log mobrovac@tin Finished deploy [changeprop/deploy@065a06e]: Bug fix: Apply ignore_errors on a per-request basis (duration: 01m 17s) [08:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:17] !log Stop MySQL on db2019 to copy its data to db2084 - T178359 [08:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:24] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:25:33] (03PS2) 10Hashar: prometheus: force ferm dns resolution to Ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) [08:26:10] (03CR) 10Hashar: "Ditto for:" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [08:29:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [08:31:38] (03PS1) 10Marostegui: s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359) [08:33:26] (03CR) 10Marostegui: [C: 032] s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:33:51] (03PS3) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) [08:34:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would prefer us to add a config to nginx to proxy a request to a simple url to whatever logic we want, abstracted as much as possible fr" [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [08:34:27] (03Merged) 10jenkins-bot: s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:34:48] (03CR) 10Gehel: "Ok, will do..." [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [08:37:07] (03PS2) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 [08:38:50] (03CR) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [08:38:53] (03CR) 10Hoo man: [C: 032] Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [08:39:17] mobrovac: looking at https://logstash.wikimedia.org/goto/9209cb15771bce41c14dd744e68e606f I see much more than 1.5 req / minute... More like 30 req / minute... [08:39:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10hashar) Fixed it by MANUALLY creating a `/va... [08:40:05] (03Merged) 10jenkins-bot: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [08:40:09] (03CR) 10jenkins-bot: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [08:40:51] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[gwicke] [08:40:53] (03PS3) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 [08:41:48] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Re-enable Statement usage tracking on cawiki (T151717) (duration: 00m 50s) [08:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:00] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [08:42:21] (03CR) 10Jcrespo: [C: 031] mysql/icinga/labtest: no pages if on labtest, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [08:44:23] gehel: hm, this is weird, i can't see external requests to it on the public api metrics - https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1 [08:44:54] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/8362/neodymium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema) [08:45:19] (03PS1) 10Marostegui: mysql-core_codfw.yaml: Add db2084 to s4 and s5 [puppet] - 10https://gerrit.wikimedia.org/r/384950 (https://phabricator.wikimedia.org/T178359) [08:45:51] gehel: also, there seem to be a lot of requests to wdqs in codfw, which i'm not sure it's even possible [08:45:55] what is going on here? [08:46:58] (03CR) 10Marostegui: [C: 032] mysql-core_codfw.yaml: Add db2084 to s4 and s5 [puppet] - 10https://gerrit.wikimedia.org/r/384950 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:50:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:50:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:10] (03PS1) 10Hoo man: Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951 [08:51:11] gehel: the majority of the requests seem to be generated in codfw, which is really strange [08:51:15] mobrovac: wdqs is active / active, so traffic to codfw is definitely possible [08:52:05] mobrovac: also, all of those requests are for /sparql, with no actual query, so it looks like a sanity check [08:52:37] gehel: yeah, but we still respect the DC boundaries, so if a request originates in eqiad, it will go to wdqs in eqiad, unless its discovery dns is set up differently [08:53:41] in the end, wdqs was under too much load (our throttling does not seem to be aggressive enough) and was misbehaving, that's what need to be solved first! [08:57:05] 10Operations, 10Operations-Software-Development: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385#3692999 (10MoritzMuehlenhoff) Procurement ticket is T178392 [08:58:38] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3693001 (10jcrespo) > why would there be ~1,500 with 0's exist on English Wikipedia As I commented... [08:59:16] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3693004 (10Cervisiarius) Thanks. I'm trying to log in as I used to from my old machine: $ ssh west1@stat1005.eqiad.wmnet but I get the (expected)... [09:02:27] (03PS1) 10Muehlenhoff: Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952 [09:02:47] (03CR) 10Volans: "nitpick inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema) [09:02:54] gehel: as for the queries to /sparql, the service sends them in the post body [09:03:27] (03CR) 10Giuseppe Lavagetto: Port docker builder (0323 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [09:03:28] mobrovac: Oh, of course! [09:03:36] (03CR) 10Muehlenhoff: [C: 032] Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952 (owner: 10Muehlenhoff) [09:03:42] (03PS2) 10Muehlenhoff: Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952 [09:20:30] (03CR) 10Jcrespo: [C: 031] "+1 to the idea, specially to send a message later to all deployers. The code itself is mostly mediawiki/releng, so cannot comment much on " [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [09:21:55] (03PS2) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 [09:22:56] (03CR) 10Gehel: "Actually, we might need to split this in 2 CR for deployment..." [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [09:23:24] gehel: ok, i think i know where all of these requests are coming from. so, in one round of checks, the service sends 3 requests to wdqs (each 60 secs), multiplied by 10 hosts doing this, that's 0.5 reqs/s, however, when the automatic monitoring script doesn't receive the status it was expecting, it retries up to 5 times, which brings us to ~2reqs/s overall [09:30:18] !log drop MobileWikiAppToCInteraction_10375484_15423246 from the log database on dbstore1002,db1047,db1046 - T177960 [09:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:25] T177960: Archive tables to hadoop: MobileWikiAppToCInteraction_10375484_15423246 and Edit_13457736_15423246 - https://phabricator.wikimedia.org/T177960 [09:30:42] mobrovac: Oh... lovely retries! Without backoff I expect? [09:31:50] i think so, yes [09:32:07] or with a minimal back-off rate [09:34:01] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.231 second response time [09:36:25] !log reloading dbproxy1010's haproxy configuration [09:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] !log installing xserver/xvfb security updates [09:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:10] mobrovac: so in a bad situation, those retries probably make it worse... [09:40:28] In the end we are back to the initial question: how do we make wdqs more robust... [09:40:29] (03PS3) 10Muehlenhoff: Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 [09:41:16] (03PS4) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 [09:41:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 [09:41:22] (03PS4) 10Elukey: Small refactor for some kafka classes to ease creation of mirror maker profile [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [09:41:24] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 [09:43:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui) [09:43:38] (03PS5) 10Elukey: confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [09:43:54] db1082? [09:44:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui) [09:44:47] did it crash? [09:45:02] (03CR) 10Volans: [C: 031] "LGTM! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema) [09:45:33] I think it did [09:45:38] looks so [09:46:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui) [09:46:10] storage [09:46:12] looks broken [09:46:15] (03CR) 10Ema: [C: 032] cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema) [09:46:19] are you depooling it? [09:46:56] yes ^ [09:46:57] (03PS1) 10Jcrespo: mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 [09:47:08] (03CR) 10Marostegui: [C: 031] mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo) [09:47:10] +1? [09:47:29] (03CR) 10Jcrespo: [C: 032] mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo) [09:47:53] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui) [09:48:34] (03CR) 10jenkins-bot: mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo) [09:49:01] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693109 (10Marostegui) Not the first time this happens to this same host: T158188 T145533 [09:49:31] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.129 second response time [09:49:58] marostegui: repool db1098 [09:50:03] ok [09:50:06] ok to deploy or not? [09:50:10] 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3693113 (10ArielGlenn) In order to get dewiki to complete on time (before the 20th), I'm running the 4th part of the rev history con... [09:50:15] jynus: ok to depool [09:50:22] no, the merge is to repool [09:50:29] repool yes [09:50:31] ok to repool [09:51:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1098, depool db1082 (crashed) (duration: 00m 50s) [09:51:47] did db1082 came back? [09:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:27] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693117 (10Marostegui) The RAID looks good: ``` root@db1082:~# hpssacli controller all show config Smart Array P840 in Slot 1 (sn: PDNNF0ARH1910I) Port Name: 1I Port Name: 2I Internal Drive Cage at Por... [09:53:47] 171018 9:42:42 [ERROR] InnoDB: Tried to read 16384 bytes at offset 595802554368. Was only able to read 0. [09:53:48] 2017-10-18 09:42:42 7f3908df7700 InnoDB: Operating system error number 5 in a file operation. [09:53:50] InnoDB: Error number 5 means 'Input/output error' [09:54:00] clearly block device error [09:54:04] (03CR) 10Elukey: [C: 032] confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [09:54:11] (03PS6) 10Elukey: confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [09:54:21] 171018 9:42:42 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation [09:54:30] 171018 09:43:11 mysqld_safe Number of processes running now: 0 [09:54:54] 10Operations, 10ops-eqdfw, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui) p:05Triage>03High This is the second time this server has a storage crash: T158188 @Cmjohnson can we get a new RAID controller for this host? It has happened twice already. [09:55:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [09:55:16] 10Operations, 10ops-eqiad, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693131 (10Marostegui) [09:55:20] db1044 is lagging behind, is it pooled? [09:55:39] db1044 has no lag [09:55:44] Seconds_Behind_Master: 0 [09:55:51] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:55:54] "Server db1044 has 5.4986140727997 seconds of lag (>= 4.5989730358124)" [09:55:56] on logs [09:56:13] ah, maybe it is spiking some seconds (it has 0 weight) because of mydumper [09:58:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [09:59:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.096 second response time [10:00:11] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.525 second response time [10:01:39] 10Operations, 10ops-eqiad, 10DBA: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3693168 (10Marostegui) [10:04:16] I will start the slave on db1082, not worth keeping it stopped if the server is up [10:04:49] yeah [10:04:50] agreed [10:05:13] (03PS1) 10Addshore: ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) [10:06:43] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3693178 (10mobrovac) The amount of requests from the Recommendation API service actually makes sense. On each service checker script run,... [10:20:54] (03CR) 10Elukey: "I am trying to figure out where this use case fits into the https://wikitech.wikimedia.org/wiki/Puppet_coding guideline. It is technically" [puppet] - 10https://gerrit.wikimedia.org/r/384608 (owner: 10Ottomata) [10:21:05] !log imported linux 4.9.30-2+deb9u5~bpo8+1 for jessie-wikimedia [10:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:57] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3693217 (10jcrespo) To try to solve the previous issue, the following grants have been executed: ``` root@dbstore2001> set sql_log_bin=0; GRANT SELECT,... [10:26:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3693218 (10elukey) p:05Triage>03Normal [10:32:12] (03PS7) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [10:34:46] (03PS1) 10Elukey: Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405) [10:50:52] !log rebooting multatuli for kernel update [10:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:23] !log mobrovac@tin Started deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now) [10:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:23] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [11:00:42] known ^ [11:00:46] (03CR) 10Jcrespo: [C: 031] Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:02:21] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15742 bytes in 0.025 second response time [11:02:46] !log mobrovac@tin Finished deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now) (duration: 05m 23s) [11:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:01] !log mobrovac@tin Started deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now), part #2 [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:31] (03CR) 10Elukey: [C: 032] Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:11:54] !log mobrovac@tin Finished deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now), part #2 (duration: 08m 53s) [11:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:34] (03PS1) 10Jcrespo: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) [11:18:07] (03CR) 10Marostegui: [C: 031] maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo) [11:18:39] (03CR) 10Jcrespo: [C: 032] maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo) [11:19:53] (03Merged) 10jenkins-bot: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo) [11:20:05] (03CR) 10jenkins-bot: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo) [11:21:04] !log upgrading mw1261 to wikidiff2 1.5.1 [11:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:46] !log Optimize pagelinks and templatelinks on db1102 s7 - T174509 [11:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:54] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:31:49] !log upgrading mw1262-mw1265 (canaries) to wikidiff2 1.5.1 [11:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:01] (03PS1) 10Marostegui: s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359) [11:39:25] (03CR) 10Marostegui: [C: 032] s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:40:11] (03Merged) 10jenkins-bot: s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:45:34] !log mobrovac@tin Started deploy [restbase/deploy@99052c1]: Use Cass3 storage for ruwiki summaries [11:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:37] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1105 to help db1071 because db1082 has crashed (duration: 00m 50s) [11:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:33] !log mobrovac@tin Finished deploy [restbase/deploy@99052c1]: Use Cass3 storage for ruwiki summaries (duration: 08m 59s) [11:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:14] high errors on mediawikiwiki, strange [11:55:53] do we have an "errors per mw version" view¿ [11:57:43] mw1265 errors, seem to be due to the upgrade [11:59:36] PROBLEM - Nginx local proxy to apache on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time [11:59:53] I think something is wrong with TTMServerMessageUpdateJob [11:59:56] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.624 second response time [12:00:09] since 11:04 [12:00:37] RECOVERY - Nginx local proxy to apache on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.143 second response time [12:00:54] mobrovac: something it could be deployed? [12:00:56] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 74332 bytes in 0.416 second response time [12:01:42] jynus: the hhvm failures? no, not likely [12:01:44] no, it does not start then [12:03:19] It is an ongoing issue with ttmserver jobs, I think [12:04:15] !log mobrovac@tin Started deploy [restbase/deploy@3c7abf6]: Bug fix: Remove the space in the summary table name [12:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:40] hhvm dumped core on mw1206, having a look [12:09:50] that's unrelated to the upgrade on mw1261-mw1265 [12:11:12] there's a number of fatal errors caused by a stack overflow in remex-html [12:11:28] !log mobrovac@tin Finished deploy [restbase/deploy@3c7abf6]: Bug fix: Remove the space in the summary table name (duration: 07m 13s) [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10faidon) nutcracker ships `/usr/lib/tmpfiles.... [12:20:10] 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10MoritzMuehlenhoff) deployment-videoscaler01... [12:28:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3693635 (10Gilles) [12:28:45] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10Gilles) Videoscalers don't run Thumbor [12:29:25] (03PS1) 10BBlack: browsersec: exclude static/images so the errorpage logo can fetch [puppet] - 10https://gerrit.wikimedia.org/r/384976 [12:30:10] (03CR) 10BBlack: [C: 032] browsersec: exclude static/images so the errorpage logo can fetch [puppet] - 10https://gerrit.wikimedia.org/r/384976 (owner: 10BBlack) [12:30:29] (03PS3) 10BBlack: errorpage: Migrate from back-compat wmf.png to wmf-logo.png [puppet] - 10https://gerrit.wikimedia.org/r/381274 (owner: 10Krinkle) [12:30:38] (03CR) 10BBlack: [V: 032 C: 032] errorpage: Migrate from back-compat wmf.png to wmf-logo.png [puppet] - 10https://gerrit.wikimedia.org/r/381274 (owner: 10Krinkle) [12:30:56] (03PS3) 10BBlack: errorpage: Set explicit height on logo image [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle) [12:31:02] (03CR) 10BBlack: [V: 032 C: 032] errorpage: Set explicit height on logo image [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle) [12:34:18] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [12:35:34] (03PS3) 10BBlack: cacheproxy: IPv4 PMTUD when blackhole detected [puppet] - 10https://gerrit.wikimedia.org/r/384526 [12:50:02] (03CR) 10Muehlenhoff: Synchronise jenkins package to thirdparty/ci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [12:51:37] (03PS2) 10Muehlenhoff: Synchronise jenkins package to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) [12:55:03] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3657517 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['db1107.eqiad.wmnet', 'db11... [12:59:07] (03PS2) 10Zfilipin: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson) [12:59:09] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3693697 (10Gehel) I'm not sure why we have a retry in the first place. An exponential back-off would be good, or to honor the "Retry-After... [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1300). [13:00:04] dcausse and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] I can SWAT today [13:00:14] o/ [13:00:15] o/ [13:01:14] dcausse, phuedx: do you expect your commit to take a long time to deploy/test? [13:01:25] zeljkof: nope [13:01:28] zeljkof: no [13:02:16] ok, in that case starting in the calendar order, dcausse first, phuedx second [13:02:30] dcausse: will ping you when the commit is at mwdebug [13:02:33] ok [13:03:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson) [13:05:04] (03Merged) 10jenkins-bot: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson) [13:05:49] !log uploaded wikidiff2 1.5.1 to apt.wikimedia.org [13:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:13] (03CR) 10jenkins-bot: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson) [13:07:09] phuedx: your commit is merged, will ping you in a minute when it's at mwdebug1002 [13:07:15] thanks zeljkof [13:08:14] phuedx: it's at mwdebug1002 [13:08:19] ta [13:09:47] dcausse: the commit is merged, will be at mwdebug in a minute or two [13:09:54] zeljkof: sure [13:11:10] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3693708 (10Gehel) [13:11:18] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3693722 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1108.eqiad.wmnet', 'db1107.eqiad.wmnet'] ``` and were **ALL** successful. [13:11:28] \o/ [13:12:44] zeljkof: lgtm [13:13:20] phuedx: ok, deploying [13:13:29] dcausse: it's at mwdebug1002 [13:13:35] zeljkof: ok testing [13:14:11] zeljkof: cool, i'll be watching this graph: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=Popups&from=now-30m&to=now [13:14:24] zeljkof: task number for the deploy subject: T176469 [13:14:24] T176469: Relaunch page previews a/b test on en and de wiki - https://phabricator.wikimedia.org/T176469 [13:15:14] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:383378|pagePreviews: Restart A/B test on enwiki and dewiki (T176469)]] (duration: 00m 51s) [13:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:23] phuedx: deployed, please check [13:18:30] zeljkof: confirmed that the change is visible in the browser (the config values on enwiki and dewiki are as expected) [13:19:00] phuedx: great! thanks for deploying with #releng ;) [13:19:01] and seeing events appearing in the pipeline (per the graph) [13:20:25] dcausse: just checking, do you need more time to test? [13:20:29] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3693745 (10faidon) Ah! Yes, that all makes sense now, thanks! We ha... [13:20:35] zeljkof: yes [13:20:45] ok, no rush, just checking [13:21:00] !log repooling wdqs1003 now that it has catched up on updates [13:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] zeljkof: looks good [13:21:56] dcausse: ok, deploying [13:22:58] !log zfilipin@tin Synchronized php-1.31.0-wmf.3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:384669|[cirrus] Turn on recall A/B test on enwiki (T177502)]] (duration: 00m 50s) [13:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] T177502: Deploy A/B test to test relaxing the retrieval query filter - https://phabricator.wikimedia.org/T177502 [13:23:11] dcausse: deployed, please check [13:23:35] zeljkof: perfect, thanks! [13:23:47] dcausse: thanks for releasing with #releng ;) [13:28:07] !log EU SWAT finished! [13:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:53] (03PS1) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) [13:38:27] (03CR) 10Herron: "So there is an rsyncd already running on the puppetmasters serving /var/lib/puppet/server/ssl/ca and /var/lib/puppet/volatile. This would " [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [13:41:03] 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3693836 (10elukey) Updating this task in light of the recent discussions. The analytics and DBA teams have been fighting a lot with disk space consumption on dbstore1002 due t... [13:46:31] (03PS2) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) [13:52:39] (03PS1) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [13:53:11] (03CR) 10jerkins-bot: [V: 04-1] Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [13:53:20] (03PS2) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [13:53:50] (03CR) 10jerkins-bot: [V: 04-1] Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [13:58:13] (03PS3) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [14:03:14] !log create reading-lists cloud project [14:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:36] !log bump up quota on cyberpower cloud project per T178332 [14:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:45] T178332: Request increased quota for cyberbot Cloud VPS project - https://phabricator.wikimedia.org/T178332 [14:03:55] !log add static ip to cloud mwstake project per T178012 [14:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:05] T178012: Request increased quota for mwstake Cloud VPS project - https://phabricator.wikimedia.org/T178012 [14:04:12] someone is executing SELECT SLEEP(@time) on terbium ... [14:05:33] (03Abandoned) 10Herron: puppetmaster: add yaml fact directory to rsyncd on frontends [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [14:12:24] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3693993 (10RobH) The details on a working ssh config are listed here: https://wikitech.wikimedia.org/wiki/Production_shell_access#Standard_config... [14:16:04] (03PS1) 10BBlack: browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982 [14:17:48] (03PS4) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [14:23:43] !log uploaded hhvm 3.18.5+dfsg-1+wmf1+deb9u1 to apt.wikimedia.org/stretch-wikimedia [14:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:56] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3694021 (10dbarratt) >>! In T178313#3693001, @jcrespo wrote: > As I commented above, I think by def... [14:32:11] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694025 (10Cmjohnson) The disk has been replaced [14:32:21] (03CR) 10Ema: [C: 031] browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982 (owner: 10BBlack) [14:39:36] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694030 (10Cmjohnson) A new DIMM has been requested with Dell. You have successfully submitted request SR955416674. [14:43:31] (03CR) 10BBlack: [C: 032] browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982 (owner: 10BBlack) [14:48:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694047 (10Cmjohnson) A case with HPE has been submitted Your case was successfully submitted. Please note your Case ID: 5323881381 for future reference. [14:50:29] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10jcrespo) ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, Rebuilding) ``` [14:56:38] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3694097 (10Cmjohnson) @herron The server is out of warranty but I took a PSU from a decom server and replaced psu1 on analytics1037 and I no longer see the error. Take a look an... [14:56:54] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3694099 (10thcipriani) [15:03:02] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694113 (10Dzahn) exim-ganglia stats can/should be removed from everything, which will also close this. also, i haven't received any of this in a long ti... [15:03:11] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694114 (10Dzahn) p:05High>03Normal [15:03:40] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292437 (10Dzahn) a:03Dzahn [15:08:11] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:08:13] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694164 (10Paladox) @Dzahn yep, I’ve applied them manually on the puppet master and there somewhere in the git log. [15:08:21] PROBLEM - Host labvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:11] PROBLEM - Nginx local proxy to apache on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [15:10:08] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694172 (10RobH) a:03cwdent @cwdent: This makes sense to me! I'm on clinic duty this week, so I'll be assisting you in getting your access reinstated. https://wikitech.wikimed... [15:10:11] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:10:12] RECOVERY - Nginx local proxy to apache on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time [15:11:57] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#3694181 (10Dzahn) [15:13:32] RECOVERY - Host labvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [15:14:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694183 (10Cmjohnson) @bd808 I swapped the CPU's to see if the error follows the CPU. The replacement that I put in there was refurbished so there is a possibility it was ba... [15:14:42] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694184 (10Marostegui) Great!! Thanks! [15:15:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694186 (10Marostegui) Thank you! [15:15:26] 10Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#3694187 (10Dzahn) @volans This seems to be what you mentioned in last monitoring meeting when you suggested an Icinga alert for cronspam. [15:17:09] (03PS1) 10RobH: cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406) [15:18:25] (03PS2) 10RobH: cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406) [15:18:49] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694199 (10cwdent) @RobH yes that is the correct group, thanks! I only used the login a handful of times before but I am pretty sure it was to stat100[45] [15:19:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694201 (10RobH) 05Open>03stalled a:05cwdent>03RobH I've prepared the patchset above. If no objections are noted, the 3 day wait for group additions... [15:19:56] 10Operations, 10monitoring: Cron spam: figure out a way it doesn't get ignored - https://phabricator.wikimedia.org/T178311#3694206 (10Volans) Closing in favour of T84845 [15:20:08] 10Operations, 10monitoring: Cron spam: figure out a way it doesn't get ignored - https://phabricator.wikimedia.org/T178311#3694209 (10Volans) [15:20:10] 10Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans) [15:20:18] 10Operations, 10monitoring: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans) [15:20:54] 10Operations, 10monitoring: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans) @Dzahn thanks for pointing this out, I've merged in as duplicate the other task I had opened. [15:28:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3694233 (10herron) [15:28:50] (03PS1) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 [15:32:49] (03PS1) 10Herron: puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) [15:33:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3694240 (10herron) [15:34:43] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3694242 (10herron) [15:35:01] PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:02] PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:11] PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:26] somebody is probably hammering stat1006 [15:35:32] PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:32] PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:42] PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:51] PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:53] PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:35:53] PROBLEM - puppet last run on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:38:21] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:39:06] (03CR) 10Volans: "Much nicer, thanks for the fixes" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [15:41:14] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3694264 (10jcrespo) Latest run does flow and others too, correctly: ``` root@dbstore2001:/srv/backups/x1.20171017220041$ ls -la | grep flowdb -rw-r--r--... [15:41:22] 10Operations, 10ops-ulsfo, 10netops: connect new office link to asw-ulsfo - https://phabricator.wikimedia.org/T176350#3694265 (10RobH) 05stalled>03Resolved I neglected to resolve this, but it was handled within a day or so of our onsite work being completed. [15:41:51] RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient [15:41:51] RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set [15:41:51] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational [15:42:01] RECOVERY - configured eth on stat1006 is OK: OK - interfaces up [15:42:10] RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full [15:42:11] RECOVERY - Disk space on stat1006 is OK: DISK OK [15:42:37] oomkiller did all the work [15:43:33] (03PS5) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [15:43:46] elukey: which process got killer? [15:45:51] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:48:00] RECOVERY - HP RAID on db1092 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK [15:48:31] (03PS6) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [15:48:50] moritzm: a python process that was probably a data crunching one [15:50:07] k [15:50:43] (03PS7) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [15:52:11] RECOVERY - DPKG on stat1006 is OK: All packages OK [15:53:29] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694315 (10jcrespo) 05Open>03Resolved a:03jcrespo ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK) RECOVERY - HP RAID on db1092 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:... [15:54:00] RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:55:55] (03PS8) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [15:55:57] (03PS2) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 [15:56:11] PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:56:11] PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:56:21] PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:56:22] grrr [15:56:25] checking again [15:56:30] PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:57:00] PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:57:00] PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:57:01] PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:57:01] PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds [15:57:51] PROBLEM - puppet last run on stat1006 is CRITICAL: Return code of 255 is out of bounds [16:01:06] (talking with the owner of the script in analytics) [16:02:42] (03PS2) 10Herron: puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) [16:04:19] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [16:05:14] (03CR) 10Herron: [C: 032] puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [16:05:37] (03PS1) 10RobH: install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423) [16:06:30] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/8368/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [16:07:12] (03CR) 10RobH: [C: 032] install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423) (owner: 10RobH) [16:07:20] (03PS2) 10RobH: install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423) [16:10:33] (03PS1) 10Alexandros Kosiaris: Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999 [16:10:35] (03PS1) 10Alexandros Kosiaris: package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000 [16:10:37] (03PS1) 10Alexandros Kosiaris: package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 [16:10:39] (03PS1) 10Alexandros Kosiaris: package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002 [16:12:03] RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient [16:12:03] RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set [16:12:03] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational [16:12:13] RECOVERY - DPKG on stat1006 is OK: All packages OK [16:12:13] RECOVERY - configured eth on stat1006 is OK: OK - interfaces up [16:12:33] RECOVERY - Disk space on stat1006 is OK: DISK OK [16:12:53] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:13:03] RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:13:23] RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full [16:19:05] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694375 (10RobH) [16:21:22] (03CR) 10Paladox: [C: 031] "Passes" [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [16:24:35] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3694379 (10debt) p:05Triage>03High [16:25:02] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown) [16:25:03] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 38 failures. Last run 2 minutes ago with 38 failures. Failed resources (up to 3 shown): File[/home/robh],File[/home/thcipriani],File[/home/jgreen],File[/home/dduvall] [16:25:16] .... [16:25:28] herron: ^ could you change cause that? [16:25:32] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 41 failures. Last run 2 minutes ago with 41 failures. Failed resources (up to 3 shown): File[/etc/fonts/conf.d/10-antialias.conf],File[/usr/local/bin/hhvmadm],File[/usr/local/sbin/hhvm-dump-debug],File[/usr/local/sbin/hhvm-collect-heaps] [16:25:49] or mine did... not sure [16:25:52] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:25:53] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 36 failures. Last run 2 minutes ago with 36 failures. Failed resources (up to 3 shown) [16:25:59] im going to back mine out cuz its easy. [16:26:02] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:03] robh possibly [16:26:12] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 38 failures. Last run 3 minutes ago with 38 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/home/hashar],File[/home/yuvipanda] [16:26:14] herron: did you wanna revert yours? [16:26:21] ill elave mine alone since mine is less likely [16:26:22] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:27] I bounced apache on the puppetmasters to pick it up. curious if these will clear on next run [16:26:28] unles si have a bad regex in my site.pp entry. [16:26:33] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_hadoop_yarn_node_state],File[/usr/lib/nagios/plugins/check_timedatectl] [16:26:52] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:53] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:57] but i dont see how my change would do this [16:27:06] The proxy server could not handle the request (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn) [17:16:24] (03PS4) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 [17:16:47] (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn) [17:19:02] (03PS5) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 [17:19:35] (03PS1) 10Ottomata: Move require_package('prometheus-jmx-exporter') to profile::prometheus::jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/385008 [17:20:08] (03CR) 10jerkins-bot: [V: 04-1] Move require_package('prometheus-jmx-exporter') to profile::prometheus::jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata) [17:20:16] (03CR) 10Dzahn: [C: 032] "for labs VPS project by Cyberpower" [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn) [17:21:05] (03PS2) 10Ottomata: Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 [17:22:01] (03CR) 10Ottomata: [C: 032] Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata) [17:22:08] (03PS3) 10Ottomata: Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 [17:22:10] (03CR) 10Ottomata: [V: 032 C: 032] Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata) [17:28:45] (03CR) 10Phuedx: [C: 031] Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson) [17:29:19] (03PS1) 10Ottomata: Own jmx exporter config files as root [puppet] - 10https://gerrit.wikimedia.org/r/385009 [17:32:29] (03PS2) 10Ottomata: Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 [17:32:54] (03CR) 10jerkins-bot: [V: 04-1] Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 (owner: 10Ottomata) [17:33:14] (03PS3) 10Ottomata: Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 [17:33:53] (03CR) 10Ottomata: [C: 032] Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 (owner: 10Ottomata) [17:44:08] ping legoktm [17:44:19] pong-ish [17:46:31] davidwbarratt: ? [17:47:16] legoktm https://phabricator.wikimedia.org/T177667#3680012 able to figure out why xdebug is missing? [17:48:27] no, I haven't had a chance to look into that [17:48:40] I think hashar recently changed how xdebug is installed in CI? [17:49:20] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694640 (10RobH) [17:51:20] idk my bff jill [17:54:57] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694663 (10RobH) cp4030 was ready to install, but now it won't let me connect to com2. I was connected, but my session timed out while I was on via com2, and now it won't let me ba... [17:56:24] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694668 (10RobH) [17:59:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694673 (10bd808) >>! In T171473#3694183, @Cmjohnson wrote: > Let's monitor and see if the error persists. We probably need to find a way to put load on this system rather t... [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:22] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694688 (10RobH) [18:17:19] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:28] (03PS3) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [18:20:43] (03PS4) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [18:21:10] (03CR) 10Ottomata: "This works in labs, except I'm unsure of how to test the prometheus jmx exporter part." [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [18:21:14] (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [18:22:54] (03PS5) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [18:23:12] (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [18:26:21] (03PS8) 10BBlack: link against jemalloc and tune it a bit [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384435 [18:26:23] (03PS10) 10BBlack: Release 0.1.0 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382873 [18:26:25] (03PS1) 10BBlack: Allow queue memory shrinking [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385015 [18:26:45] (03PS6) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [18:27:16] (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [18:27:23] (03PS7) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [18:41:11] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694773 (10RobH) [18:42:30] !log restarting rabbitmq-server on labcontrol1001; too many timeouts [18:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:12] (03PS3) 10Zoranzoki21: Sort dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man) [18:47:26] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:57:49] (03CR) 10BBlack: [V: 032 C: 032] Various minor improvements/updates [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384165 (owner: 10BBlack) [18:57:54] (03CR) 10BBlack: [V: 032 C: 032] Remove multi-head support from strq, move into purger. [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382865 (owner: 10BBlack) [18:57:55] !log restarting nova-network on labnet1001 [18:58:00] (03CR) 10BBlack: [V: 032 C: 032] Move all URL parsing and HTTP req generation to receiver [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382867 (owner: 10BBlack) [18:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:05] (03CR) 10BBlack: [V: 032 C: 032] Chain the purgers together and split their stats [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382868 (owner: 10BBlack) [18:58:15] (03CR) 10BBlack: [V: 032 C: 032] Bump http-parser upstream src to 2.7.1 + fixups [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382870 (owner: 10BBlack) [18:58:16] (03CR) 10BBlack: [V: 032 C: 032] Refactor (rewrite?!) purging code [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384167 (owner: 10BBlack) [18:58:19] (03CR) 10BBlack: [V: 032 C: 032] strq+purger: refactor, simplify, add queue delays [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384433 (owner: 10BBlack) [18:58:24] (03CR) 10BBlack: [V: 032 C: 032] Rework stats further [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384434 (owner: 10BBlack) [18:58:29] (03CR) 10BBlack: [V: 032 C: 032] Allow queue memory shrinking [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385015 (owner: 10BBlack) [18:58:35] (03CR) 10BBlack: [V: 032 C: 032] link against jemalloc and tune it a bit [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384435 (owner: 10BBlack) [18:58:41] (03CR) 10BBlack: [V: 032 C: 032] Release 0.1.0 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382873 (owner: 10BBlack) [19:00:05] no_justification: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:39] (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [19:00:51] no_justification i found a mistake in ^^ heh [19:01:04] i was wondering why gerrit.war was not showing as a symnlink in bin/ [19:01:27] !log ran LoginNotify/maintenance/migratePreferences.php on test and test2 [19:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:16] paladox: Tbh, we could just run it from the one outside of review_site and ignore bin/ entirely :) [19:02:25] oh can we? [19:02:29] ah i see [19:02:36] It doesn't matter, since we provide full paths to -d [19:02:36] that will need a systemd change [19:02:42] Either way, it's one change [19:02:45] yeh [19:03:12] (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [19:03:15] This is probably safer and doesn't need a restart [19:03:18] So amending [19:03:22] yep [19:03:28] (03PS2) 10Chad: Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 [19:04:25] (03CR) 10Paladox: [C: 031] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [19:04:59] (03PS1) 10BBlack: Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385018 [19:05:01] (03PS1) 10BBlack: vhtcpd (0.1.0-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385019 [19:08:52] (03CR) 10Smalyshev: wdqs: LVS check should reach blazegraph and do a simple query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [19:12:47] im guessing tommror we can dismantle the gerrit deb repo :). If things are still ok. [19:13:58] (03CR) 10Ladsgroup: [C: 031] "That's a great idea." [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [19:15:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694836 (10chasemp) @Andrew scheduled 20 instances to this server and 4 think they came up and the rest failed. ```2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager... [19:15:43] !log stopping nodepool temporarily to give rabbit a chance to catch up [19:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:30] !log restarting nodepool [19:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:45] (03CR) 10BBlack: [V: 032 C: 032] Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385018 (owner: 10BBlack) [19:18:52] (03CR) 10BBlack: [V: 032 C: 032] vhtcpd (0.1.0-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385019 (owner: 10BBlack) [19:19:55] !log uploaded vhtcpd-0.1.0-1 to jessie-wikimedia, testing on cp1008 only for now [19:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:21] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:24:21] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:24:32] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:24:51] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:02] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:11] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:32] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:32] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:41] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:41] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:51] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:52] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:52] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:02] PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:12] PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:23] ^did someone merge something invasive recently? [19:26:32] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:42] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:42] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:42] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:42] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:52] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:01] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:02] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:11] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:11] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:12] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:12] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:20] chasemp: picked random one, elastic1036, no issue there. so looks like master [19:27:20] chasemp i think it's puppetdb. [19:27:22] ...seems to be transient where I checked [19:27:35] mutante: same experience here [19:27:41] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:02] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:02] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:11] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:12] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:22] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:01] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:01] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:11] PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:13] (03CR) 10Zoranzoki21: ":D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man) [19:29:21] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:21] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:21] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:31] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:41] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:01] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:01] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:02] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:02] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:11] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:32] PROBLEM - puppet last run on dbproxy1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:41] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:41] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:51] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:31:25] puppetdb is running.. but i think paladox is right that it was a short outage there [19:31:28] and it's recovering now [19:31:31] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:31:39] I saw a lot of transient puppetmaster1002 puppet-master[8600]: The environment must be purely alphanumeric, not '' [19:31:49] but I'm unsure if that's normal noise or not as there is a lot of noise [19:32:04] (03PS1) 10RobH: lvs400[567] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/385022 (https://phabricator.wikimedia.org/T178436) [19:32:20] chasemp: that looks like normal noise to me [19:32:21] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:27] seen that a lot without this issue [19:35:22] (03CR) 10RobH: [C: 032] lvs400[567] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/385022 (https://phabricator.wikimedia.org/T178436) (owner: 10RobH) [19:35:41] k, I see almost all successes at the moment but am not sure what was the deal [19:35:49] gotta hop in a meeting in a minute tho [19:36:12] hmm I was just running some catalog diffs. wonder if that is related? [19:36:24] if it bogs down the db likely... [19:36:26] could have increased the load on the masters [19:36:33] puppetdb that is [19:36:42] but it was one at a time with a sleep in between each [19:41:05] (03PS1) 10RobH: new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436) [19:43:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694964 (10Andrew) I think the VM creation failure was a (mostly? completely?) unrelated issue. I've rescheduled some actually running VMs there, and will see how they do. [19:51:42] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:51:52] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:52:02] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:52:12] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:52:12] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:53:02] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:53:02] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:53:11] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:53:21] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:01] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:54:11] RECOVERY - puppet last run on labnodepool1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:21] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:54:21] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:54:21] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:54:21] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:54:21] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:54:31] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:32] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:54:41] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:54:51] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:54:52] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:55:01] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:55:02] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:55:02] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:55:02] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:11] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:55:11] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:32] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:55:32] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:32] RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:55:42] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:55:42] RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:42] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:55:42] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:51] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:55:52] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:56:02] RECOVERY - puppet last run on mwlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:56:08] (03PS1) 10Ottomata: Temporarily allow rsync access to FRACK hosts. [puppet] - 10https://gerrit.wikimedia.org/r/385027 (https://phabricator.wikimedia.org/T178509) [19:56:12] RECOVERY - puppet last run on mc1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:56:37] Jeff_Green: ^^ [19:56:42] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:56:42] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:56:42] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:57:01] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:57:11] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:57:11] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:57:13] ottomata: great. I'm doing terrible queries now :-) [19:57:21] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:57:41] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:57:41] dont' forget to limit to the partitions you need :D [19:58:11] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:58:43] limit to partitions? I've got the date limits and the hostname and path stuff, I'm not sure what you mean beyond that? [19:59:01] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:59:53] (03CR) 10Ottomata: [C: 032] Temporarily allow rsync access to FRACK hosts. [puppet] - 10https://gerrit.wikimedia.org/r/385027 (https://phabricator.wikimedia.org/T178509) (owner: 10Ottomata) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:08] date limits [20:00:11] are mostly [20:00:13] you can also limit to text data [20:00:22] jouncebot: Nothing for ORES today [20:00:36] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Always_restrict_queries_to_a_date_range_.28partitioning.29 [20:00:45] jgreen so ya, the dates you need [20:00:45] and [20:00:49] webrequest_source='text' [20:00:55] that way you don't also have to read upload, etc. [20:01:41] oh ho [20:02:32] i'll do that next time, it's too late for the current query which is currently writing files [20:02:51] is there a way to consolidate output into a single file instead of the many in dir thing? [20:04:59] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3694999 (10hashar) @gilles sorry for the spam... [20:05:05] (03CR) 10Hashar: Create /run/nutcracker on stretch onwards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [20:05:35] nothing for mobileapps today [20:08:24] Jeff_Green: from hive not really [20:08:33] ok [20:08:34] since the files are being written by individual process on various nodes across the cluster [20:08:42] but you can do it when you copy to your home dir out of hdfs [20:08:55] yup yup, easy enough [20:08:58] hdfs dfs -text /path/to/files/in/hdfs/* > localfile [20:09:13] or -copyToLocal and hten cat * > localfile [20:09:34] ok [20:14:29] ottomata: I've been using "overwrite local directory" and so on, does is that redundant with what's created in /mnt/hdfs/.../jgreen/ ? [20:14:36] (03PS2) 10RobH: new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436) [20:15:05] (03CR) 10RobH: [C: 032] new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436) (owner: 10RobH) [20:16:28] in hive you can set some configs to cause hive to collapse files together. Only really useful if its a script that runs regularly or something though. See lines 29-37: https://gerrit.wikimedia.org/r/#/c/317019/12/oozie/query_clicks/daily/query_clicks_daily.hql [20:16:32] !log Changing MTUs on interfaces to NTT [20:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:21] oh ebernhardson [20:17:22] -- Enable result file merging [20:17:22] set hive.merge.mapfiles=true; [20:17:22] set hive.merge.mapredfiles=true; [20:17:24] didn't know about that! [20:17:25] really? [20:17:28] that's cool [20:17:32] joal: ^^^ did you know about that? [20:17:55] Jeff_Green: not sure i understand your question [20:17:56] but just FYI [20:18:15] /mnt/hdfs is just a read only mount to the HDFS file system, so that you can cd, ls, etc. into it [20:18:34] for your case its probably find to copy files directly from it, since I think your data is relatively small(?), but if it isn't [20:18:40] it'd be better to get it using the hdfs dfs CLI [20:18:51] I've been working from the example you gave me earlier, https://gerrit.wikimedia.org/r/#/c/327003/1/oozie/webrequest/legacy_tsvs/generate_sampled-1000_tsv.hql -- in that query there's an local output directory specified [20:18:56] hdfs dfs -ls, hdfs dfs -get, hdfs dfs -copyToLocal, -text, etc. [20:19:17] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695031 (10RobH) [20:19:30] ya, whatever you put for directory there is an HDFS directory [20:19:52] the landingpages stuff is a few GB for the ~8 days in question, the /beacon/impressions stuff (which we normally sample 1:10) is much larger I think [20:20:04] so yes, should be in HDFS somewhere, (and vailable in the /mnt/hdfs mount) [20:20:32] k, might be faster / safer (/mnt/hdfs *usually* works, but not always) to use hdfs CLI [20:20:34] hdfs dfs [20:21:26] ottomata: yea it seems to work, result files usually end up somewhere between 128M and 256M [20:21:32] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 42 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:21:52] ottomata: hdfs cli as opposed to using hive at all? [20:23:11] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 115 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [20:23:48] Jeff_Green: no no, to copy the files that Hive creates out of HDFS [20:24:06] wont those files be hard to parse, as they will be in some hadoop format like parquet? [20:24:19] Jeff_Green: is generating TSVs using Hive :p [20:24:22] oh [20:24:28] then it might work :) [20:24:29] https://gerrit.wikimedia.org/r/#/c/327003/1/oozie/webrequest/legacy_tsvs/generate_sampled-1000_tsv.hql [20:24:32] something like that ^ [20:24:41] Jeff_Green: once Hive is done [20:24:46] you'll have multiple files in HDFS [20:25:01] to get single files, probably easiest to just cat them together using hdfs cli into a local file [20:25:03] somethign like [20:25:06] right, they look like tsvs in the output format specified and there's [20:25:26] hdfs dfs -text /path/to/hive/output/in/hdfs/* > localfile.tsv [20:25:29] there are also .crc files [20:25:34] oh ya [20:25:38] wait [20:25:41] in hdfs? [20:25:58] I don't know :-) this is what I'm trying to understand [20:26:32] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:27:11] that example you posted above writes a whole lot of output files in whatever dir you specify, for the landingpages query there are 31K output files, half of which are *.crc [20:29:11] each one of the non-crc files contains a batch of logs in the TSV output format specified in the CONCAT() in that query [20:33:11] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [20:33:24] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3695061 (10RobH) @Cervisiarius: I'm also on clinic duty this week, so you should feel free to ping for assistance via IRC. Sometimes it is easier... [20:33:57] ottomata: you can see the queries I'm running in stat1005:/home/jgreen/query_notes.txt and the output of the first is on disk already in /home/jgreen/landing [20:34:12] that said I gotta change venues, I'll be back in ~15 [20:42:16] OH [20:42:19] LOCAL DIRECTORY [20:42:20] huh. [20:42:21] cool [20:42:31] then you dont' need hdfs dfs CLI [20:42:36] ok, understand why you have .crcs now [20:42:37] cool [20:42:45] you don't really need those, you can probably just rm them when its done [20:43:03] UHHH [20:43:30] AHH Jeff_Green come back! [20:43:33] you are doing a crazy thing [20:55:46] ottomata: still around? [20:56:26] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695131 (10RobH) [21:03:33] 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695158 (10RobH) a:05RobH>03None [21:04:01] 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH) This is now ready for someone in #traffic to take over for service implementation/replacement of the existing lvs400[1-4]. [21:04:03] Jeff_Green: [21:04:03] ya [21:04:06] you are doing some crazy stuff! [21:04:07] haha [21:04:16] you still ahve TABLESAMPLE(BUCKET 1 OUT OF 10 ON rand()) [21:04:19] you want that? [21:04:20] :-$ [21:04:56] yeah, we sample 1:10 for /beacon/impressions specifically, i think it was a preposterous amount of data otherwise and we don't really need that level of precision [21:05:02] ohhh ok [21:05:06] sorry maybe not so crazy then [21:05:07] we do that now at the kafkatee pipe [21:05:11] ahhh [21:05:11] ok cool [21:05:23] ok so i see what you are doing then, this looks fine. LOCAL DIRECTORY makes sense [21:05:24] also when I ran it without that the output was >150GB :-) [21:05:31] and i see why you have .crcs then [21:05:34] you wont' need those [21:05:42] you can delete them after hive is done [21:05:50] but ya, you've written this to local FS, not HDFS [21:05:53] so you don't need hdfs cli [21:06:01] you can do whatever you want to cat the files together [21:07:58] I'm a little puzzled what the output is, in the local directory. Here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries they talk about statements "to HDFS filesystem" being the most efficient, I'm pretty sure that's ~not~ what's happening specifying a directory in my homedir, but not positive [21:08:03] (03PS3) 10Dzahn: Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [21:09:32] 10Operations, 10ops-ulsfo, 10Traffic: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3695169 (10RobH) [21:09:36] 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH) [21:09:38] 10Operations, 10ops-ulsfo, 10Traffic: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3695169 (10RobH) [21:10:38] Jeff_Green: that's right [21:10:41] by doing LOCAL DIRECTORY [21:10:43] you are not using HDFS [21:10:48] ok [21:11:05] so that means that every process across the cluster that would have written data to the local disk where they run [21:11:11] instead is beaming the data to your local client [21:11:12] (03CR) 10Dzahn: [C: 032] Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad) [21:11:18] and writing to your local disk [21:11:25] ok [21:11:35] you would need to pull it down to your homedir eventually anyway [21:11:36] so this is fine [21:12:04] when that's done I'll just rsync them to americum and parse/sort/split into 15 minute increment logfiles to clobber the bad ones [21:12:28] meanwhile kafkatee is still running clean since 10/4 [21:12:59] and casey is working on packaging & testing librdkafka 0.9.6 and the latest kafkatee version [21:13:00] great [21:13:14] Jeff_Green: you have to split into 15 min logfiles? why? [21:13:15] just curious [21:14:29] we did that historically to keep the reporting lag low, I can probably backfill in larger chunks but it's easy enough to automate [21:14:50] (03PS1) 10Odder: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) [21:15:28] I'm not sure yet what they have to do at the parser to rerun that timeframe... [21:16:17] did you see I was finally able to pinpoint the recovery time? [21:20:31] Jeff_Green: yeah but it didn't correlate with anyting, right? [21:22:05] not that I've found, no [21:22:43] also the fact that we saw loss after that window with kafkacat on alnitak doesn't bode well for an isolated event [21:23:21] woot. rsync is working [21:24:42] no_justification hi, with your patch, it will use the scap deployed version which is 2.13.9 compared to the run we are curretnly running. [21:24:48] do we need to restart gerrit [21:24:49] ? [21:25:16] Eh, we should upgrade anyway [21:25:22] ok thanks [21:25:47] mutante ^^ [21:30:03] no_justification: so you want to do the upgrade right now? i can submit it [21:30:12] i am also on a bus, but it's fine :) [21:30:17] I'll do the upgrade first, uno momento [21:30:21] ok [21:32:58] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [21:32:58] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:33:52] Gerrit's back [21:33:58] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [21:33:58] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [21:34:11] https://gerrit.wikimedia.org/r/ [21:34:12] 503 [21:34:14] * mutante submits https://gerrit.wikimedia.org/r/#/c/384760/ [21:34:24] permission errors? [21:34:25] Press f5 ;-) [21:34:33] oh [21:34:34] works [21:34:46] thanks :) [21:34:53] no_justification: your change is on puppetmaster now.. but i'm not on cobalt itself [21:35:23] well. to be correct.. now it is really done [21:35:48] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:36:41] Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/bin/gerrit.war]/ensure: ensure changed 'file' to 'link' [21:36:42] Wheee [21:37:01] yea, so it didnt have an issue linking over the existing file [21:37:06] :) [21:37:07] but after an "unlink" the original file would be gone [21:37:15] windows users will like the fix for inline edit :). [21:37:24] paladox: hehe [21:37:27] :) [21:37:39] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [21:38:18] nice, now you can actually remove the old package setup [21:38:20] right [21:38:28] yeh [21:38:44] Yeah. Basically that's simple: remove the package from the debian repo, archive (actually, probably just delete) the git repos [21:39:02] Oh, remove from puppet manifest first I s'pose :) [21:39:13] heh :) [21:39:16] :) [21:39:17] * paladox submits change [21:39:25] my bus is arriving, be back soon [21:39:55] (03PS1) 10Chad: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 [21:39:57] (03CR) 10Chad: [C: 032] group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad) [21:40:09] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:42:09] (03Draft1) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:42:14] (03PS2) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:42:19] no_justification mutante ^^ :) [21:42:19] (03Merged) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad) [21:42:21] ok [21:42:35] (03CR) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad) [21:42:44] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:43:28] (03PS3) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:44:09] (03Draft1) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 [21:44:12] (03PS2) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 [21:44:21] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:44:45] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox) [21:44:55] sorry for spam. [21:45:00] (03PS4) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:45:09] (03PS3) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 [21:45:40] * paladox cherry picks [21:46:45] works [21:46:51] no puppet errors [21:47:25] (03CR) 10Chad: Gerrit: Remove gerrit package from apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:48:11] (03CR) 10Chad: [C: 031] Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox) [21:48:31] (03PS5) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:48:42] (03CR) 10Paladox: Gerrit: Remove gerrit package from apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:49:00] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:49:37] (03PS6) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [21:58:26] (03PS1) 10ArielGlenn: recompress bz2 files in batches not restricted to subjobs [dumps] - 10https://gerrit.wikimedia.org/r/385104 [21:58:52] (03PS3) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [21:59:00] (03PS4) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [22:01:13] (03Draft1) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 [22:01:16] (03PS2) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 [22:05:08] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:05:47] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:06:21] paladox: back, what's up with "missing group" [22:06:28] looks [22:06:32] nothing, just inconsitent :) [22:06:42] all the other code has group => 'gerrit2' [22:06:50] ok [22:06:59] :). [22:07:38] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:08:01] paladox: The change could not be rebased due to a conflict during merge. [22:08:10] ah the group one? [22:08:13] yep [22:08:14] that one is dependent [22:08:20] on https://gerrit.wikimedia.org/r/385100 [22:08:59] technically yea, but code-wise, no [22:09:13] but not important [22:09:22] it can go after the other one then [22:10:12] (03PS1) 10MaxSem: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) [22:10:45] paladox: did you try 385100 yet? i mean the actual puppet run with the new "require" lines [22:10:51] yes [22:10:54] :) [22:10:58] and you actually have scap on i ttoo [22:11:23] yep [22:11:28] (03PS7) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) [22:11:30] (03PS4) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 [22:12:02] rebased. [22:14:17] (03CR) 10Dzahn: [C: 032] "yea, it's like that on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox) [22:14:34] "submit incl. parents" [22:14:52] :) [22:14:55] but dont worry too much, we can do both [22:14:58] just compiling it [22:15:04] (03CR) 10Gergő Tisza: [C: 032] Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza) [22:15:11] ok thanks :) [22:16:11] (03Merged) 10jenkins-bot: Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza) [22:16:35] (03CR) 10Zoranzoki21: [C: 031] Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem) [22:16:37] (03CR) 10jenkins-bot: Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza) [22:17:54] (03PS8) 10Dzahn: Gerrit: stop installing deb package, replace with scap requires [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [22:18:14] i am just renaming it slightly because "removing it from APT" wasnt really what this does [22:18:19] thanks [22:18:23] that would be like running reprepro commands on apt.wm [22:19:05] heh [22:19:51] 10Operations, 10MediaWiki-General-or-Unknown, 10RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10MaxSem) [22:22:20] 10Operations, 10MediaWiki-General-or-Unknown, 10RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695361 (10Jdforrester-WMF) [22:22:34] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/8369/" [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [22:22:38] (03CR) 10Dzahn: [C: 032] Gerrit: stop installing deb package, replace with scap requires [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [22:22:43] thanks :) [22:24:29] on gerrit2001 - Service[gerrit]/ensure: ensure changed 'stopped' to 'running' [22:24:59] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695375 (10Legoktm) [22:25:02] heh [22:25:08] i presume because of the db thing [22:25:12] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695377 (10Jdforrester-WMF) [22:25:13] it probaly timedout [22:26:06] on cobalt: no-op [22:26:06] done [22:26:14] :) [22:26:29] (03PS5) 10Dzahn: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox) [22:26:31] package now removed from puppet code no_justification :) [22:26:38] repo can be archived now. [22:27:04] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10Jdforrester-WMF) Possible tasks blocking running MediaWiki in production in 5.6-compatible environments: * Convert `mwscript` hosts to run on Zend 5.6, not 5... [22:27:15] i also want to move something from an operations/deb repo to a non-deb repo but i want to keep my history... [22:27:37] is now cloning from the repo in "debs" hierarchy without actually building it [22:27:45] git push --mirror [22:27:46] ? [22:27:51] !log tgr@tin Synchronized wmf-config/: T174651 / gerrit 384908 - deploy ReadingLists to beta - noop for production (duration: 00m 53s) [22:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:59] T174651: Beta testing of the ReadingLists extension - https://phabricator.wikimedia.org/T174651 [22:28:10] paladox: oh, looking [22:28:15] :) [22:28:31] or there's https://stackoverflow.com/questions/44777043/git-copy-history-of-file-from-one-repository-to-another [22:28:37] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695387 (10ArielGlenn) I plan to have the dumps hosts upgraded long before this deadline approaches. [22:28:46] would it be nice to keep the gerrit history ? [22:29:13] oh, i doin't know if that would be supported [22:31:27] i merged your other change but it wasnt on channel? [22:31:55] hmm, ah i see because you merged it. [22:32:04] if jenkins did it, it would show :) [22:32:10] thanks [22:32:34] paladox: yea, so on gerrit2001, it times out like you said [22:32:39] yep [22:32:40] after a while puppet will start it again [22:32:45] what triggers scap on the beta cluster? [22:32:48] yeh [22:32:52] i18n data generation, specifically [22:33:05] tgr jenkins i think, though not sure about i18n [22:33:17] paladox: all operations/puppet changes are merged by humans though [22:33:26] nvm, looks like it did run, just was a little behind code updates [22:34:01] yep [22:36:48] (03CR) 10Putnik: [C: 031] Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [22:37:04] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695394 (10Jdforrester-WMF) [22:41:25] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3695401 (10RobH) [22:41:52] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563233 (10RobH) a:05RobH>03mark I've assigned this to @mark for feedback on the other scs consoles in esams. I'm guessing they are old cruft and no longer used? If so, this can be resolved. [22:42:25] (03PS2) 10ArielGlenn: recompress bz2 files in batches not restricted to subjobs [dumps] - 10https://gerrit.wikimedia.org/r/385104 [22:45:41] (03Abandoned) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [22:45:59] (03PS12) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) [22:55:04] !log tgr@tin Synchronized private/: Remove the llama (T150554) (duration: 00m 50s) [22:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:53] paladox: yay, keep going. the older the better, heh [22:59:16] lol [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T2300). [23:00:04] Jdlrobson, twkozlowski, and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:33] * MaxSem waves [23:03:55] \o [23:03:56] * no_justification steals deploy conch [23:03:58] For a few [23:04:26] !log demon@tin Synchronized php: symlink swap (duration: 00m 49s) [23:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:29] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.4 [23:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:21] no_justification: so are you doing the swat window or is MaxSem ? Little confused with what's happening... [23:25:03] I had a meeting in the beginning [23:25:15] no_justification: are you done? [23:27:24] assuming done [23:28:44] (03PS2) 10MaxSem: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson) [23:28:48] (03CR) 10MaxSem: [C: 032] Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson) [23:30:00] (03Merged) 10jenkins-bot: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson) [23:30:15] (03CR) 10jenkins-bot: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson) [23:31:32] (03CR) 10Aaron Schulz: Allow specifying --group to sql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [23:31:52] jdlrobson: the config change is live on mwdebug1002 [23:31:59] MaxSem: swwet. testing [23:32:19] fixed! yay! [23:32:56] MaxSem: you can sync [23:34:19] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/385007/ (duration: 00m 51s) [23:34:22] jdlrobson: ^ [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:01] hurrah. one down [23:36:00] jdlrobson: pulled the Collection change for both branches [23:36:09] MaxSem: hurrah. should be quick to test [23:38:01] next is Odder's change but I don't see him around [23:39:50] anybody familiar with WB logos to check this over? [23:41:08] MaxSem: please ping when collection changes are synced. [23:41:21] jdlrobson: waiting for an ok from you [23:41:48] MaxSem: oh didn't realise. Should I be testing on debug1002? [23:41:54] they look good there :) [23:41:55] as always [23:42:15] MaxSem: thx [23:42:27] ok, deploying [23:43:15] Why IRC pings aren't actually pinging me today.... [23:44:14] !log maxsem@tin Synchronized php-1.31.0-wmf.3/extensions/Collection/: https://gerrit.wikimedia.org/r/#/c/385005/ (duration: 00m 51s) [23:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:30] thansk MaxSem !!! :) [23:45:40] !log maxsem@tin Synchronized php-1.31.0-wmf.4/extensions/Collection/: https://gerrit.wikimedia.org/r/#/c/384903/ (duration: 00m 50s) [23:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:03] (03PS2) 10MaxSem: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [23:46:07] (03CR) 10MaxSem: [C: 032] Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [23:47:06] (03PS8) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [23:47:23] (03Merged) 10jenkins-bot: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [23:47:33] (03CR) 10jenkins-bot: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [23:47:38] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [23:49:05] !log maxsem@tin Synchronized static/images/project-logos/: https://gerrit.wikimedia.org/r/#/c/385094/2 (duration: 00m 50s) [23:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:47] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/385094/2 (duration: 00m 50s) [23:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:11] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3695536 (10RobH) [23:53:13] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3695537 (10RobH) [23:53:54] (03PS2) 10MaxSem: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) [23:54:00] (03CR) 10MaxSem: [C: 032] Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem) [23:55:11] (03Merged) 10jenkins-bot: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem) [23:56:13] (03CR) 10jenkins-bot: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem)